jbs-atolcd opened a new pull request, #6186:
URL: https://github.com/apache/hop/pull/6186

   … 
   
   This commit introduces significant improvements to the Parquet Input and 
Output transforms by implementing comprehensive support for Parquet's Logical 
Types.
   
   Previously, the transforms relied primarily on primitive types, leading to 
conversions issues and errors with data when handling complex types, such as 
Timestamps..
   
   Key Changes & Features:
   
   1. Parquet Input:
   * Logical Type Mapping: Refactors the field discovery to use 
`LogicalTypeAnnotation` (instead of only primitive type), enabling correct 
mapping for semantic types.
   * Timestamp/Date Precision: Implements a conversion mechanism to map 
Parquet's timestamps units (MILLIS, MICROS..) to Hop's `TYPE_TIMESTAMP` and 
`TYPE_DATE`, preserving precision and handling UTC adjustments.
   * JSON Support: Adds explicit support for the JSON Logical Type, converting 
the Parquet binary/string data into Hop's `TYPE_JSON` object.
   * Decimal Handling: Uses precision and scale from 
`DecimalLogicalTypeAnnotation` to correctly convert binary/long Parquet 
decimals into Hop's `TYPE_BIGNUMBER`.
   
   2. Parquet Output:
   * Date/Timestamp Consistency: Ensures that Hop's `TYPE_DATE` and 
`TYPE_TIMESTAMP` are consistently converted to a `LONG` representation with the 
Parquet `timestampMillis` logical annotation, which is the most compatible 
format.
   * Schema Mapping: Maps Hop's `TYPE_JSON` and `TYPE_UUID` to Parquet `STRING` 
types in the schema definition.
   
   Testing and Validation:
   * Test Data Enrichment: The test dataset (`golden-parquet-input.json`) was 
extended to include new fields: `isActive` (Boolean), `registrationTimestamp` 
(Timestamp), and `metadataJson` (JSON), ensuring the new types are covered 
end-to-end.
   * Unit Test Update: The unit test configuration (`0029-parquet-input 
UNIT.json`) was updated to map and validate the new fields, confirming the 
correct functionality of the transform.
   
   This resolves a major limitation regarding data fidelity when dealing with 
common modern Parquet schemas.
   
   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   - [x] Run `mvn clean install apache-rat:check` to make sure basic checks 
pass. A more thorough check will be performed on your pull request 
automatically.
   - [ ] If you have a group of commits related to the same change, please 
squash your commits into one and force push your branch using `git rebase -i`.
   - [x] Mention the appropriate issue in your description (for example: 
`addresses #123`), if applicable.
   
   To make clear that you license your contribution under the [Apache License 
Version 2.0, January 2004](http://www.apache.org/licenses/LICENSE-2.0)
   you have to acknowledge this by using the following check-box.
   
   - [x] I hereby declare this contribution to be licensed under the [Apache 
License Version 2.0, January 2004](http://www.apache.org/licenses/LICENSE-2.0)
   - [ ] In any other case, please file an [Apache Individual Contributor 
License Agreement](https://www.apache.org/licenses/icla.pdf).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to