swamirishi commented on a change in pull request #536:
URL: https://github.com/apache/incubator-sedona/pull/536#discussion_r686123990



##########
File path: core/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java
##########
@@ -564,7 +588,45 @@ public void saveAsWKT(String outputLocation)
             }
         }).saveAsTextFile(outputLocation);
     }
-
+    
+    public void saveAsParquet(JavaSparkContext sc,
+                              String geometryColumn,
+                              List<Field> userColumns,
+                              String outputLocation,
+                              String namespace,
+                              String name) throws SedonaException, IOException 
{
+        sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", 
false);
+        final Job job = Job.getInstance(sc.hadoopConfiguration());
+        GeometryType geometryType = this.getGeometryType();
+        Schema schema = AvroUtils.getSchema(geometryType, geometryColumn, 
userColumns,namespace,
+                                            name);
+//        ParquetOutputFormat.setWriteSupportClass(job, 
AvroWriteSupport.class);
+        AvroParquetOutputFormat.setSchema(job, schema);
+        this.rawSpatialRDD.mapPartitionsToPair(new 
PairFlatMapFunction<Iterator<T>, Void, GenericRecord>() {
+            @Override
+            public Iterator<Tuple2<Void, GenericRecord>> call(Iterator<T> 
iterator) throws Exception {
+                Schema schema = AvroUtils.getSchema(geometryType, 
geometryColumn, userColumns, namespace,
+                                                    name);
+                Iterable<T> recordIterable = () -> iterator;
+                try {
+                    return StreamSupport.stream(recordIterable.spliterator(), 
false).map(geometry -> {
+                        try {
+                            GenericRecord genericRecord =
+                                    AvroUtils.getRecord(geometry, 
geometryType, geometryColumn, schema);

Review comment:
       We wont be able to store the required Statistics for Predicate Push. 
Parquet stores page level Aggregate statistics. Thus when we have to do a 
predicate push or spatial join. Currently this is the phase 1(Serialization & 
Deserialization from Parquet to Geometry Object). Predicate Push & Join support 
would be part of Phase 2.
   E.g. This is the Meta stats stored in a Parquet File Footer. This is a file 
storing polygon geometry data
   file created_by      parquet-mr version 1.12.0 (build 
db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
   file columns 5
   file row_groups      1
   file rows    1509
   row_group    0               size    50818
   row_group    0               rows    1509
   row_group    0               columns 5
   row_group    0       p.ex.array.x    type    DOUBLE
   row_group    0       p.ex.array.x    num_values      7545
   row_group    0       p.ex.array.x    compression     UNCOMPRESSED
   row_group    0       p.ex.array.x    encodings       RLE,PLAIN_DICTIONARY
   row_group    0       p.ex.array.x    compressed_size 25270
   row_group    0       p.ex.array.x    uncompressed_size       25270
   row_group    0       p.ex.array.x    stats:min       -158.104182
   row_group    0       p.ex.array.x    stats:max       -66.03575
   row_group    0       p.ex.array.y    type    DOUBLE
   row_group    0       p.ex.array.y    num_values      7545
   row_group    0       p.ex.array.y    compression     UNCOMPRESSED
   row_group    0       p.ex.array.y    encodings       RLE,PLAIN_DICTIONARY
   row_group    0       p.ex.array.y    compressed_size 25406
   row_group    0       p.ex.array.y    uncompressed_size       25406
   row_group    0       p.ex.array.y    stats:min       17.986328
   row_group    0       p.ex.array.y    stats:max       47.777622
   row_group    0       p.holes.array.array.x   type    DOUBLE
   row_group    0       p.holes.array.array.x   num_values      1509
   row_group    0       p.holes.array.array.x   compression     UNCOMPRESSED
   row_group    0       p.holes.array.array.x   encodings       RLE,PLAIN
   row_group    0       p.holes.array.array.x   compressed_size 38
   row_group    0       p.holes.array.array.x   uncompressed_size       38
   row_group    0       p.holes.array.array.y   type    DOUBLE
   row_group    0       p.holes.array.array.y   num_values      1509
   row_group    0       p.holes.array.array.y   compression     UNCOMPRESSED
   row_group    0       p.holes.array.array.y   encodings       RLE,PLAIN
   row_group    0       p.holes.array.array.y   compressed_size 38
   row_group    0       p.holes.array.array.y   uncompressed_size       38
   row_group    0       user_data       type    BYTE_ARRAY
   row_group    0       user_data       num_values      1509
   row_group    0       user_data       compression     UNCOMPRESSED
   row_group    0       user_data       encodings       
PLAIN_DICTIONARY,BIT_PACKED
   row_group    0       user_data       compressed_size 66
   row_group    0       user_data       uncompressed_size       66
   row_group    0       user_data       stats:min       testAttribute123
   row_group    0       user_data       stats:max       testAttribute123
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to