[jira] [Commented] (NIFI-2613) Support extracting content from Microsoft Excel (.xlxs) documents

ASF GitHub Bot (JIRA) Thu, 16 Mar 2017 21:55:00 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929417#comment-15929417
 ]


ASF GitHub Bot commented on NIFI-2613:
--------------------------------------

Github user jvwing commented on the issue:

    https://github.com/apache/nifi/pull/929
  
    Thanks for those improvements, @jdye64, I especially like the updated usage 
doc.  Two things on the latest code:
    
    1. Did you try an .xls file?  There is a problem when the flowfile 
attribute is added in the catch block on line ~195.  The NiFi framework throws 
an exception of it's own, because we can't do `session.putAttribute` inside an 
InputStreamCallback for the same flowfile:
    
    > java.lang.IllegalStateException: 
StandardFlowFileRecord[uuid=be192381-9475-4c6d-a6ca-43735e5df271,claim=StandardContentClaim
 [resourceClaim=StandardResourceClaim[id=1489713904793-1, container=default, 
section=1], offset=0, 
length=26112],offset=0,name=./conf/test-xls.xls,size=26112] already in use for 
an active callback or InputStream created by ProcessSession.read(FlowFile) has 
not been closed
    
    Something similar happens with the session.putAttribute on ~209.  As a 
result of these exceptions, the session is rolled back and the flowfile is 
returned to the input queue.  I think we can throw an exception, though.  So if 
we caught and rethrew with a different error message, it should work out.
    
    2. In the failure case, we're routing the flowfile to both 'failure' and 
'original'.  I didn't realize it earlier, but I now believe this to be unusual 
in NiFi.  Most processors treat failure as an exclusive route, and 'original' 
as part of the successful happy path.  SplitAvro, SplitJson, SplitText, and 
UnpackContent were some examples I looked at.  I doubt that's written in stone. 
 What do you think?
    
    I made a [sample code 
fork](https://github.com/jvwing/nifi/commit/2ccf5dec2dcd707c5963716dfb3fbf7813c460ea)
 with a unit test for .xls and a suggested approach to solving the 
IllegalStateExceptions, and the failure routing.  I did not get the logging to 
cooperate the way I think it should, but we're not too far off.


> Support extracting content from Microsoft Excel (.xlxs) documents
> -----------------------------------------------------------------
>
>                 Key: NIFI-2613
>                 URL: https://issues.apache.org/jira/browse/NIFI-2613
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>
> Microsoft Excel is a wildly popular application that businesses rely heavily 
> on to store, visualize, and calculate data. Any single company most likely 
> has thousands of Excel documents containing data that could be very valuable 
> if ingested via NiFi and combined with other datasources. Apache POI is a 
> popular 100% Java library for parsing several Microsoft document formats 
> including Excel. Apache POI is extremely flexible and can do several things. 
> This issue would focus solely on using Apache POI to parse an incoming .xlxs 
> document and convert it to CSV. The processor should be capable of limiting 
> which excel sheets. CSV seems like the natural choice for outputting each row 
> since this feature is already available in Excel and feels very natural to 
> most Excel sheet designs.
> This capability should most likely introduce a new "poi" module as I envision 
> many more capabilities around parsing Microsoft documents could come from 
> this base effort.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NIFI-2613) Support extracting content from Microsoft Excel (.xlxs) documents

Reply via email to