>From Wail Alkowaileet <[email protected]>: Attention is currently required from: [email protected]. Wail Alkowaileet has posted comments on this change. ( https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209 )
Change subject: [WIP] Support COPY TO in parquet ...................................................................... Patch Set 30: (19 comments) File asterixdb/asterix-common/src/main/resources/asx_errormsg/en.properties: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/53d9e4de_ddeb6f11 PS30, Line 307: Units unit https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/db93f1b0_788f5ae5 PS30, Line 307: , given . Provided https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/d535f847_8ea27207 PS30, Line 308: Unsupported compression scheme Remove and use 1096 File asterixdb/asterix-external-data/src/main/java/org/apache/asterix/external/util/ExternalDataConstants.java: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/32ef9cae_a4fd7021 PS30, Line 87: 1MB Something could be off here. Both page size and row group size are the same? https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/6739fe02_2b9a9121 PS30, Line 315: public static final String KEY_COMPRESSION_LZO = "lzo"; : public static final String KEY_COMPRESSION_LZ4_RAW = "lz4_raw"; : public static final String KEY_COMPRESSION_BROTLI = "brotli"; Can you do a simple benchmark that compares both the size and write throughput? If I remember correctly, zstd and snappy give the best compression and time. *Can be done later https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/05282866_728771bb PS30, Line 331: JSON_WRITER_SUPPORTED_COMPRESSION TEXTUAL_WRITER_SUPPORTED_COMPRESSION 'Those' will be used for CSV as well File asterixdb/asterix-external-data/src/main/java/org/apache/asterix/external/util/WriterValidationUtil.java: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/4eb4c84a_5f2e0393 PS30, Line 115: validateJSONCompression validateTextualCompression File asterixdb/asterix-om/pom.xml: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/367fa1de_01cb992b PS30, Line 163: <dependency> Fix indentation File asterixdb/asterix-om/src/main/java/org/apache/asterix/om/pointables/printer/parquet/ParquetRecordLazyVisitor.java: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/c154fefa_39728e4c PS30, Line 55: ARecordType This will fail if it is ANY. Because ANY isn't of type ARecordType https://github.com/apache/asterixdb/blob/6d124481c4238b59b754eec83811c28596bb6d29/asterixdb/asterix-om/src/main/java/org/apache/asterix/om/types/BuiltinType.java#L970 https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/7d257cdf_73fe1cbd PS30, Line 82: throw new HyracksDataException(ErrorCode.TUPLE_DOES_NOT_AGREE_WITH_GIVEN_SCHEMA You cannot assume any exception is a typing error. For example, what if getOrCreateFieldNameIndex() is the one that throws an exception? Also, the message does not tell me anything (I have no clue what's the offending type). As a user, I would need more information about the offending type and what the actual data type is. You need to make sure 'Type type' is a group type and it corresponds to an object. Take a look at line 90. It is exactly the same as this line (i.e., 67). How do you know if 'type' corresponds to an object or an array? You need to more digging in 'type' to ensure that the type is an array or object See: https://github.com/apache/asterixdb/blob/6d124481c4238b59b754eec83811c28596bb6d29/asterixdb/asterix-external-data/src/main/java/org/apache/asterix/external/input/record/reader/hdfs/parquet/AsterixTypeToParquetTypeVisitor.java#L195 https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/94d36201_2a7b01e5 PS30, Line 90: asGroupType See the comment above File asterixdb/asterix-om/src/main/java/org/apache/asterix/om/pointables/printer/parquet/ParquetRecordVisitorUtils.java: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/ecb8211d_4b436765 PS30, Line 60: case BOOLEAN: : case BINARY: : case FIXED_LEN_BYTE_ARRAY: : case INT96: remove. 'default' suffices. https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/a31913e3_e807657d PS30, Line 65: HyracksDataException throw RuntimeDataException.create(ErrorCode.TYPE_MISMATCH_GENERIC, ATypeTag, primitiveTypeName) https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/331bff3c_9e30dc57 PS30, Line 176: switch (primitiveTypeName) { : case BOOLEAN: : recordConsumer.addBoolean(booleanValue); : break; : case BINARY: : case INT32: : case INT64: : case FLOAT: : case DOUBLE: : case FIXED_LEN_BYTE_ARRAY: : case INT96: : default: : throw new HyracksDataException( : "Typecast impossible from " + typeTag + " to " + primitiveTypeName); : } Replace with if. if(!BOOLEAN.eqauls(primitiveTypeName) { throw .... } boolean booleanValue = ABooleanSerializerDeserializer.getBoolean(b, s); recordConsumer.addBoolean(booleanValue); https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/69503f04_82051ef5 PS30, Line 200: case BINARY: : case UUID: : case POINT: : case DURATION: : case POINT3D: : case ARRAY: : case MULTISET: : case OBJECT: : case SPARSOBJECT: : case UNION: : case ENUM: : case TYPE: : case ANY: : case LINE: : case POLYGON: : case CIRCLE: : case RECTANGLE: : case INTERVAL: : case SYSTEM_NULL: : case YEARMONTHDURATION: : case DAYTIMEDURATION: : case SHORTWITHOUTTYPEINFO: remove https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/084016dd_93e22e1d PS30, Line 222: NULL We should be able to handle NULL if the schema said the value is optional. https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/86baa3e7_7ab0ddde PS30, Line 223: case GEOMETRY: : case UINT8: : case UINT16: : case UINT32: : case UINT64: : case BITARRAY: remove https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/6c76dedf_9814a3b3 PS30, Line 229: MISSING Could treated as NULL https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209/comment/b35c37cb_bc7ed7f7 PS30, Line 230: case DATETIME: remove -- To view, visit https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/18209 To unsubscribe, or for help writing mail filters, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Change-Id: I40dc16969e66af09cde04b460f441af666b39d51 Gerrit-Change-Number: 18209 Gerrit-PatchSet: 30 Gerrit-Owner: [email protected] Gerrit-Reviewer: Anon. E. Moose #1000171 Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: [email protected] Gerrit-CC: Wail Alkowaileet <[email protected]> Gerrit-Attention: [email protected] Gerrit-Comment-Date: Mon, 29 Apr 2024 19:56:33 +0000 Gerrit-HasComments: Yes Gerrit-Has-Labels: No Gerrit-MessageType: comment
