[GitHub] [incubator-sedona] swamirishi commented on a change in pull request #536: [SEDONA-36] Parquet reader & Writers

GitBox Tue, 24 Aug 2021 13:38:56 -0700


swamirishi commented on a change in pull request #536:
URL: https://github.com/apache/incubator-sedona/pull/536#discussion_r695185286




##########
File path: 
core/src/main/java/org/apache/sedona/core/io/parquet/ParquetFileReader.java
##########
@@ -0,0 +1,60 @@
+package org.apache.sedona.core.io.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.parquet.avro.AvroParquetInputFormat;
+import org.apache.parquet.cli.util.Expressions;
+import org.apache.parquet.cli.util.Schemas;
+import org.apache.parquet.hadoop.ParquetInputFormat;
+import org.apache.sedona.core.constants.SedonaConstants;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+
+public class ParquetFileReader {
+    /**
+     * Reads Parquet File with given geometry Column and the relevant User 
columns
+     * @param sc
+     * @param geometryColumn
+     * @param userColumns
+     * @param inputPaths
+     * @return Avro Record RDD which needs to be deserialized into a 
GeometryRDD
+     * @throws IOException
+     */
+    public static JavaRDD<GenericRecord> readFile(JavaSparkContext sc,
+                                                  String geometryColumn,
+                                                  List<String> userColumns,
+                                                  String... inputPaths) throws 
IOException {
+        final Job job = Job.getInstance(sc.hadoopConfiguration());

Review comment:
       This is to import hadoop based configurations if at all there are any. 
Nothing else but that. Basically getting all the resource paths based 
configuration. If you look into the code of sc.textFile() it would be doing the 
same thing internally




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-sedona] swamirishi commented on a change in pull request #536: [SEDONA-36] Parquet reader & Writers

Reply via email to