[GitHub] [incubator-pinot] timsants opened a new pull request #6046: Deep Extraction Support for ORC, Thrift, and ProtoBuf Records

GitBox Tue, 22 Sep 2020 10:17:10 -0700


timsants opened a new pull request #6046:
URL: https://github.com/apache/incubator-pinot/pull/6046



   ## Description
   1. PR for issue 
[#5507](https://github.com/apache/incubator-pinot/issues/5507). ORC, Thrift, 
and ProtoBuf readers now convert:
       - Nested structures to Map
       - Collection to Object[]
       - Number/String/bytebuffer to single value
   2. All extractors now support extracting all fields if fieldsToRead is 
null/empty (issue 
[#5677](https://github.com/apache/incubator-pinot/issues/5677)). This support 
was
      added to ORCRecordExtractor, ThriftRecordExtractor, 
ProtoBufRecordExtractor, and CSVRecordRecord.
   3. Extractor Util Cleanup:
          There were duplicate implementations for extractor converters across 
RecordReaderUtils, JsonRecordExtractorUtils,
          and AvroUtils. This PR adds a new method, `Object convert(Object 
value)`, to the RecordExtractor interface, as
          this is a method that all extractors should implement to convert each 
field of the file format. A new abstract
          class was created that extends RecordExtractor to contain the 
repeated logic across RecordReaderUtils,
          JsonRecordExtractorUtils, and AvroUtils. The abstract class also 
defines the common methods for recursively
          handling maps, collections, records and single values.
   
   ## Release Notes
   **ORC Records**
   
   Before this PR:
   - All single value ORC types were converted to number/string/byte[]
   - List type as Object[]
   - Map type as Map<Object, Object>
   - There was no case for handling ORC struct types. An 
IllegalArgumentException would have been thrown if a struct type field was 
present.
   - Only 1 level of nesting was handled in Map and Array.
   
   After this PR:
   - All single value ORC types were converted to number/string/byte[]
   - List type as Object[]
   - Map type as Map<Object, Object>
   - ORC struct type as Map<Object, Object>
   - Nested extraction is supported for List, Map and Struct types. Only nested 
Map values are supported (keys are handled as a single value).
   
   **Thrift Records**
   
   Before this PR:
   - All single value Thrift types were converted to number/string/byte[]
   - List types as Object[] with only 1 level of nesting
   - Maps or Thrift structs were converted by calling `.toString()` on it and 
as a result, would not preserve nested object structures.
   - Prior to this change, it was assumed that each field ID in the Thrift 
record was consecutive, but this assumption is not enforced by Thrift compiler.
   
   After this PR:
   - All single value Thrift types are converted to number/string/byte[]
   - List types as Object[]
   - Map as Map<Object, Object>
   - TBase type (Thrift struct) as Map<Object, Object>
   - Nested extraction is supported for List, Map and Struct types. Only nested 
Map values are supported (keys are handled as a single value).
   - The initialization of fields is modified such that the field IDs are taken 
from the structMetataMap of the Thrift Object therefore field IDs do not need 
to be consecutive.
   
   **ProtoBuf Records**
   
   Before this PR:
   - All single value ProtoBuf types were converted to number/string/byte[]
   - Repeated type (array) as Object[] with only 1 level of nesting
   - Map types were incorrectly handled as a collection and ProtoBuf Messages 
were converted by calling `.toString()` on it and as a result, would not 
preserve nested object structures.
   
   After this PR:
   - All single value ProtoBuf types are converted to number/string/byte[]
   - Repeated type (array) as Object[]
   - Map as Map<Object, Object>
   - ProtoBuf nested messages as Map<String, Object>
   - Nested extraction is supported for List, Map and Struct types. Only nested 
Map values are supported (keys are handled as a single value).
   
   **Backwards incompatibility**
   With the new extraction support of nested fields/complex objects, if a 
Thrift, ProtoBuf or ORC file contained fields with Maps/Collection with complex 
objects, those objects will be now retained, instead of converting them using 
`.toString()`. Therefore, any client expecting the old treatment of nested 
fields will be impacted.
   
   In addition, if the `fieldsToRead` param is ever null/empty for the 
RecordReader, all fields of the record will now be read. Prior to this change, 
no field would have been read by the RecordReader.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] timsants opened a new pull request #6046: Deep Extraction Support for ORC, Thrift, and ProtoBuf Records

Reply via email to