[ 
https://issues.apache.org/jira/browse/NIFI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929143#comment-15929143
 ] 

ASF GitHub Bot commented on NIFI-2613:
--------------------------------------

Github user jvwing commented on the issue:

    https://github.com/apache/nifi/pull/929
  
    @jdye64 Thanks, I looked into some of these error cases:
    
    * Unsupported .xls - Throws exception, routes to 'original'.  The error 
bulletin seems a bit terse:
    > ConvertExcelToCSVProcessor[id=43426b41-015a-1000-b06e-3a9be79162d1] 
Package should contain a content type part [M1.13]
    * Blank Sheet - Succeeds, one empty flowfile to 'success' and the original 
to 'original'.  No Errors.  Seems correct to me.
    * Empty Workbook - I wasn't able to create an empty workbook manually in 
Excel.  
    * Unmatched Sheet Names - Covered by testNonExistantSpecifiedSheetName, 
seems correct to me.
    * Diverse Content - I tried a number of bizarre things Excel lets you put 
in spreadsheets -- images, tables, pivot tables, formulas, hidden sheets, etc.  
Where appropriate, the processor returned content (cells in tables and pivot 
tables, last computed formula value), but did not run into errors.  Images were 
ignored.
    * CSV-Breaking Content - The processor does not escape text to enforce a 
CSV structure.  Sheets containing multiline text in a cell, cells with commas, 
etc. resulted in improperly formed CSV files.  I do not object, I understand 
that's a significant scope increase.
    
    I recommend that you do the following:
    
    1. Clearly document that the processor only supports .xlsx and NOT .xls 
files.  I know you've already answered questions about this on the dev list, so 
somebody's going to try it.
    2. For the .xls case, would it be possible to catch the 
`org.apache.poi.openxml4j.exceptions.InvalidFormatException`, repackage that 
with a helpful error message suggesting that only well-formed XLSX files are 
accepted, and route to failure?
    3. Document that the processor does not escape invalid CSV content.
    4. Logging changes from the code review post above.


> Support extracting content from Microsoft Excel (.xlxs) documents
> -----------------------------------------------------------------
>
>                 Key: NIFI-2613
>                 URL: https://issues.apache.org/jira/browse/NIFI-2613
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>
> Microsoft Excel is a wildly popular application that businesses rely heavily 
> on to store, visualize, and calculate data. Any single company most likely 
> has thousands of Excel documents containing data that could be very valuable 
> if ingested via NiFi and combined with other datasources. Apache POI is a 
> popular 100% Java library for parsing several Microsoft document formats 
> including Excel. Apache POI is extremely flexible and can do several things. 
> This issue would focus solely on using Apache POI to parse an incoming .xlxs 
> document and convert it to CSV. The processor should be capable of limiting 
> which excel sheets. CSV seems like the natural choice for outputting each row 
> since this feature is already available in Excel and feels very natural to 
> most Excel sheet designs.
> This capability should most likely introduce a new "poi" module as I envision 
> many more capabilities around parsing Microsoft documents could come from 
> this base effort.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to