josephglanville commented on issue #5492: Native parallel batch indexing 
without shuffle
URL: https://github.com/apache/incubator-druid/pull/5492#issuecomment-407996093
 
 
   @jihoonson this is outside of the scope of this PR but would you be 
interested in collaborating on making the current Firehose layered abstractions 
better support non-textual formats?
   
   The backstory is that we mostly use InputRowParsers that operate on 
ByteBuffer and read from either raw byte messages from Kafka or SequenceFile 
format from archival storage (GCS in our case).
   
   We would like to modify/extend the current prefetching and iterating 
abstractions to support iteration over other file formats not just newline 
delimited files and most importantly support emitting non-string rows for 
parsing so that ByteBufferInputRowParsers can be utilised with native batch 
ingestion.
   
   In my mind there is a missing abstraction layer that should handle creating 
an iterator from a file that returns rows that can then be passed to 
InputRowParsers.
   Basically an InputFileFormat interface where the current implementation 
would be TextFileInputFormat and we would want a SequenceFileInputFormat but 
any iterable file format would be possible.
   This separates the concern of reading rows from the files themselves from 
the Firehose which should be responsible for connecting to storage and fetching 
files.
   
   We could of course sidestep this by creating a custom Firehose that simply 
implements the exact logic we want and handles prefetching etc without 
utilising the existing interfaces but we would much prefer upstreaming an 
approach that enables batch processing for all users wanting to process 
non-textual formats.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to