Re: [PR] added usecases,pointers to API [parquet-java]

via GitHub Sun, 11 Aug 2024 02:52:50 -0700


wgtmac commented on code in PR #2983:
URL: https://github.com/apache/parquet-java/pull/2983#discussion_r1712949042



##########
parquet-avro/README.md:
##########
@@ -44,3 +44,80 @@ Apache Avro integration
 | `parquet.avro.add-list-element-records` | `boolean` | Flag whether to assume 
that any repeated element in the schema is a list element.<br/>The default 
value is `true`. |
 | `parquet.avro.write-parquet-uuid`       | `boolean` | Flag whether to write 
the [Parquet UUID logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid)
 in case of an [Avro UUID 
type](https://avro.apache.org/docs/current/spec.html#UUID) is present.<br/>The 
default value is `false`. |
 | `parquet.avro.writeFixedAsInt96`    | `String` | Comma separated list of 
paths pointing to Avro schema elements which are to be converted to `INT96` 
Parquet types.<br/>The path is a `'.'` separated list of field names and does 
not contain the name of the schema nor the namespace. The type of the 
referenced schema elements must be `fixed` with the size of 12 
bytes.<br/>**NOTE: The `INT96` Parquet type is deprecated. This option is only 
to support old data.** |
+
+## Apache Parquet-Avro is a combination of two technologies:
+
+1. **Apache Parquet**: A columnar storage format for Hadoop and other big data 
platforms, optimized for querying large datasets.
+
+2. **Apache Avro**: A data serialization system that provides a compact, fast, 
and efficient way to serialize and deserialize data.
+
+## Major aspects of Apache Parquet-Avro:
+
+**Parquet aspects**:
+
+1. **Columnar storage**: Stores data in columns instead of rows, reducing 
storage and improving query performance.
+2. **Efficient compression**: Supports various compression algorithms, 
reducing storage and improving data transfer.
+3. **Schema evolution**: Allows for schema changes without rewriting existing 
data.
+4. **High-performance querying**: Optimized for fast querying and data 
retrieval.
+
+**Avro aspects**:
+
+1. **Data serialization**: Provides a compact, fast, and efficient way to 
serialize and deserialize data.
+2. **Schema-based**: Uses a schema to define data structures, ensuring data 
consistency and compatibility.
+3. **Language-independent**: Supports multiple programming languages, 
including Java, Python, and C++.
+4. **Rich data structures**: Supports complex data structures, including 
records, enums, and arrays.
+
+## Parquet-Avro integration:
+
+1. **Avro schema integration**: Uses Avro schemas to define Parquet data 
structures.
+2. **Efficient Avro data storage**: Stores Avro data in Parquet files, 
leveraging Parquet's columnar storage and compression.
+3. **Seamless data exchange**: Enables easy data exchange between Avro and 
Parquet formats.
+
+By combining Parquet's columnar storage and Avro's data serialization, 
Parquet-Avro provides a powerful and efficient way to store and query large 
datasets.
+
+## Specific  and practical applications of Apache Parquet-Avro
+
+1. **Data ingestion and processing**: Using Parquet-Avro to efficiently store 
and process large amounts of data from various sources, like logs, sensors, or 
social media.
+
+2. **Data transformation and aggregation**: Leveraging Parquet-Avro's columnar 
storage to perform fast data transformations and aggregations, like data 
warehousing or business intelligence workloads.
+
+3. **Real-time data analytics**: Utilizing Parquet-Avro's efficient storage 
and querying capabilities to power real-time data analytics, like fraud 
detection or live dashboards.
+
+4. **Machine learning data preparation**: Using Parquet-Avro to store and 
prepare large datasets for machine learning model training, feature 
engineering, and data augmentation.
+
+5. **Data archiving and compliance**: Employing Parquet-Avro's compression and 
encryption features to archive large datasets for long-term storage and 
compliance purposes.
+
+6. **Cloud data migration**: Migrating large datasets to cloud storage 
services, like Amazon S3 or Google Cloud Storage, using Parquet-Avro for 
efficient data transfer and storage.
+
+7. **Data integration and ETL**: Using Parquet-Avro as a common format for 
data integration and ETL (Extract, Transform, Load) processes, ensuring data 
consistency and compatibility.
+
+8. **IoT data processing and analytics**: Storing and processing large amounts 
of IoT sensor data using Parquet-Avro, enabling efficient data analytics and 
insights.
+
+## Parquet-Avro API Overview

Review Comment:
   Similarly, I don't think we need API overview because they are already 
covered by publicly available javadoc: 
https://javadoc.io/doc/org.apache.parquet/parquet-avro/latest/index.html



##########
parquet-avro/README.md:
##########
@@ -44,3 +44,80 @@ Apache Avro integration
 | `parquet.avro.add-list-element-records` | `boolean` | Flag whether to assume 
that any repeated element in the schema is a list element.<br/>The default 
value is `true`. |
 | `parquet.avro.write-parquet-uuid`       | `boolean` | Flag whether to write 
the [Parquet UUID logical 
type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid)
 in case of an [Avro UUID 
type](https://avro.apache.org/docs/current/spec.html#UUID) is present.<br/>The 
default value is `false`. |
 | `parquet.avro.writeFixedAsInt96`    | `String` | Comma separated list of 
paths pointing to Avro schema elements which are to be converted to `INT96` 
Parquet types.<br/>The path is a `'.'` separated list of field names and does 
not contain the name of the schema nor the namespace. The type of the 
referenced schema elements must be `fixed` with the size of 12 
bytes.<br/>**NOTE: The `INT96` Parquet type is deprecated. This option is only 
to support old data.** |
+
+## Apache Parquet-Avro is a combination of two technologies:
+
+1. **Apache Parquet**: A columnar storage format for Hadoop and other big data 
platforms, optimized for querying large datasets.
+
+2. **Apache Avro**: A data serialization system that provides a compact, fast, 
and efficient way to serialize and deserialize data.
+
+## Major aspects of Apache Parquet-Avro:

Review Comment:
   IMHO, we don't need these kinds of descriptions in the parquet-java 
repository. A better place would be the parquet site: 
https://parquet.apache.org/docs/overview/  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] added usecases,pointers to API [parquet-java]

Reply via email to