arenger commented on issue #3414: NIFI-5900 Added SplitLargeJson processor
URL: https://github.com/apache/nifi/pull/3414#issuecomment-481294396
 
 
   @ottobackwards I originally sought to improve `SplitJson` instead of adding 
a new processor.  I could certainly submit a different PR targeting an 
improvement to `SplitJson`, but there were a few reasons I thought a different 
processor might be better:
   
   1. The `SplitLargeJson` processor is designed to always output complete JSON 
documents.  This differs from the `SplitJson` behavior.  For example, when 
splitting an array of strings, `SplitJson` would output `String1`, `String2`, 
etc, but `SplitLargeJson` would output `["String1"]`, `["String2"]`, etc.  This 
can be advantageous when the output relation (the split-relation) is directed 
to another processor that expects JSON.
   2. The `SplitJson` processor can only split arrays.  The JSON Path must 
target an array in the document.  However, `SplitLargeJson` can split arrays 
_and_ objects.  If the JSON Path points to an object then it will output all 
the key-value pairs of that object in separate flowfiles.
   3. The `SplitJson` processor sets a `fragment.count` attribute on outgoing 
flowfiles to indicate the total number of documents that were split from the 
designated JSON Path.  This is by nature impossible when using a sax-like 
(streaming) approach to reading the JSON because the processor is designed to 
avoid loading the whole document into memory at the same time.  Therefore, in 
order to preserve the current function, a setting would need to be added to 
optionally engage the optimized handling for large files -- with a stated 
caveat that the `fragment.count` attribute would be unavailable.
   
   Again, I could submit a different pull request that targets an optimization 
of `SplitJson` rather than an addition of a new `SplitLargeJson` processor.  I 
started down that path originally, with a boolean setting to optionally 
activate large file processing (and in that mode it could also split objects, 
provided the JSON Path was not "overly complex" [i.e. require backtracking, 
etc]) -- but then I had to change the processor to occasionally output non-json 
documents which made the code less elegant.  That said, I could see the value 
in sticking with one processor.
   
   As for JsonSurfer, I had honestly never heard of it.  My code here was from 
a work project I did a couple years ago that was finally approved for release 
to the public.  I could probably make a change to `SplitJson` that employs 
JsonSurfer... I'm bummed my code isn't as novel as I'd hoped, but I know that's 
how things go!
   
   Let me know what you think is best.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to