swamirishi commented on a change in pull request #536:
URL: https://github.com/apache/incubator-sedona/pull/536#discussion_r686123990
##########
File path: core/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java
##########
@@ -564,7 +588,45 @@ public void saveAsWKT(String outputLocation)
}
}).saveAsTextFile(outputLocation);
}
-
+
+ public void saveAsParquet(JavaSparkContext sc,
+ String geometryColumn,
+ List<Field> userColumns,
+ String outputLocation,
+ String namespace,
+ String name) throws SedonaException, IOException
{
+ sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata",
false);
+ final Job job = Job.getInstance(sc.hadoopConfiguration());
+ GeometryType geometryType = this.getGeometryType();
+ Schema schema = AvroUtils.getSchema(geometryType, geometryColumn,
userColumns,namespace,
+ name);
+// ParquetOutputFormat.setWriteSupportClass(job,
AvroWriteSupport.class);
+ AvroParquetOutputFormat.setSchema(job, schema);
+ this.rawSpatialRDD.mapPartitionsToPair(new
PairFlatMapFunction<Iterator<T>, Void, GenericRecord>() {
+ @Override
+ public Iterator<Tuple2<Void, GenericRecord>> call(Iterator<T>
iterator) throws Exception {
+ Schema schema = AvroUtils.getSchema(geometryType,
geometryColumn, userColumns, namespace,
+ name);
+ Iterable<T> recordIterable = () -> iterator;
+ try {
+ return StreamSupport.stream(recordIterable.spliterator(),
false).map(geometry -> {
+ try {
+ GenericRecord genericRecord =
+ AvroUtils.getRecord(geometry,
geometryType, geometryColumn, schema);
Review comment:
We wont be able to store the required Statistics for Predicate Push.
Parquet stores page level Aggregate statistics. Thus when we have to do a
predicate push or spatial join. Currently this is the phase 1(Serialization &
Deserialization from Parquet to Geometry Object). Predicate Push & Join support
would be part of Phase 2.
E.g. This is the Meta stats stored in a Parquet File Footer. This is a file
storing polygon geometry data
file created_by parquet-mr version 1.12.0 (build
db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
file columns 5
file row_groups 1
file rows 1509
row_group 0 size 50818
row_group 0 rows 1509
row_group 0 columns 5
row_group 0 p.ex.array.x type DOUBLE
row_group 0 p.ex.array.x num_values 7545
row_group 0 p.ex.array.x compression UNCOMPRESSED
row_group 0 p.ex.array.x encodings RLE,PLAIN_DICTIONARY
row_group 0 p.ex.array.x compressed_size 25270
row_group 0 p.ex.array.x uncompressed_size 25270
row_group 0 p.ex.array.x stats:min -158.104182
row_group 0 p.ex.array.x stats:max -66.03575
row_group 0 p.ex.array.y type DOUBLE
row_group 0 p.ex.array.y num_values 7545
row_group 0 p.ex.array.y compression UNCOMPRESSED
row_group 0 p.ex.array.y encodings RLE,PLAIN_DICTIONARY
row_group 0 p.ex.array.y compressed_size 25406
row_group 0 p.ex.array.y uncompressed_size 25406
row_group 0 p.ex.array.y stats:min 17.986328
row_group 0 p.ex.array.y stats:max 47.777622
row_group 0 p.holes.array.array.x type DOUBLE
row_group 0 p.holes.array.array.x num_values 1509
row_group 0 p.holes.array.array.x compression UNCOMPRESSED
row_group 0 p.holes.array.array.x encodings RLE,PLAIN
row_group 0 p.holes.array.array.x compressed_size 38
row_group 0 p.holes.array.array.x uncompressed_size 38
row_group 0 p.holes.array.array.y type DOUBLE
row_group 0 p.holes.array.array.y num_values 1509
row_group 0 p.holes.array.array.y compression UNCOMPRESSED
row_group 0 p.holes.array.array.y encodings RLE,PLAIN
row_group 0 p.holes.array.array.y compressed_size 38
row_group 0 p.holes.array.array.y uncompressed_size 38
row_group 0 user_data type BYTE_ARRAY
row_group 0 user_data num_values 1509
row_group 0 user_data compression UNCOMPRESSED
row_group 0 user_data encodings
PLAIN_DICTIONARY,BIT_PACKED
row_group 0 user_data compressed_size 66
row_group 0 user_data uncompressed_size 66
row_group 0 user_data stats:min testAttribute123
row_group 0 user_data stats:max testAttribute123
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]