[jira] [Commented] (NIFI-2613) Support extracting content from Microsoft Excel (.xlxs) documents

ASF GitHub Bot (JIRA) Tue, 18 Oct 2016 13:24:12 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586567#comment-15586567
 ]


ASF GitHub Bot commented on NIFI-2613:
--------------------------------------

Github user jvwing commented on the issue:

    https://github.com/apache/nifi/pull/929
  
    I have a few more comments from running the processor and reviewing the 
code:
    
    **Processor Annotations**
    * The CapabilityDescription says this processor transfers one flowfile for 
each sheet in the Excel document, but that was not my experience.  It 
aggregated all content into one output flowfile.  Is that the intended behavior?
    * Most NiFi attribute names are lowercase, "sheetname" or "sheet.name" 
instead of "SheetName".  I don't believe there is a hard and fast rule there, 
but I recommend going lowercase if you do not have a strong preference.
    
    **Properties**
    * PropertyDescriptors should have `name` as a machine readable key like 
"extract-sheets" and `displayName` for the user "Sheets to Extract".
    
    **Exception Handling and Logging**
    Most NiFi components use `getLogger().error(...` as you do later in the 
code, rather than `printStackTrace()`.  I haven't tested these error cases, but 
they appear to be informational rather than stopping or failing the flow.  
Would `getLogger().info(...` or `getLogger().debug(...` be a better fit here?  
Do users need the full stack trace, or would just the message be helpful enough?
    ```
    } catch (InvalidFormatException e) {
        e.printStackTrace();
    } catch (OpenXML4JException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    }
    ```



> Support extracting content from Microsoft Excel (.xlxs) documents
> -----------------------------------------------------------------
>
>                 Key: NIFI-2613
>                 URL: https://issues.apache.org/jira/browse/NIFI-2613
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>
> Microsoft Excel is a wildly popular application that businesses rely heavily 
> on to store, visualize, and calculate data. Any single company most likely 
> has thousands of Excel documents containing data that could be very valuable 
> if ingested via NiFi and combined with other datasources. Apache POI is a 
> popular 100% Java library for parsing several Microsoft document formats 
> including Excel. Apache POI is extremely flexible and can do several things. 
> This issue would focus solely on using Apache POI to parse an incoming .xlxs 
> document and convert it to CSV. The processor should be capable of limiting 
> which excel sheets. CSV seems like the natural choice for outputting each row 
> since this feature is already available in Excel and feels very natural to 
> most Excel sheet designs.
> This capability should most likely introduce a new "poi" module as I envision 
> many more capabilities around parsing Microsoft documents could come from 
> this base effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-2613) Support extracting content from Microsoft Excel (.xlxs) documents

Reply via email to