[Java] How to get an efficient data ingestion solution using Apache Arrow and Apache Parquet

João Antonio Ferreira Sat, 18 Jun 2022 17:53:12 -0700

I'm working on a data lake solution for an IoT framework that does 44Khz
data acquisition for a few dozen sensors (~990.000 measures/seconds ) and
would like suggestions on how to get an efficient data ingestion solution
using Java 11+, Apache Arrow and Apache Parquet .


For data ingestion I am currently using the AvroParquetWriter
implementation at https://github.com/apache/parquet-mr and I would like to
partition the dataset using two fields: timestamp and sensor name.

I'm not finding examples of creating partitioned datasets in this AI.

I can switch from Parquet File Write API. The solution does not need to
support distributed clustered processing. Just separate the partitions into
different directories on the local filesystem.

By the way, I currently use DataFusion to query datasets writed by
AvroParquetWriter.

Regards

João Antonio

[Java] How to get an efficient data ingestion solution using Apache Arrow and Apache Parquet

Reply via email to