[GitHub] [arrow-cookbook] lidavidm commented on a diff in pull request #258: [Java]: Adding Dataset ORC/IPC examples

GitBox Mon, 19 Sep 2022 06:56:03 -0700


lidavidm commented on code in PR #258:
URL: https://github.com/apache/arrow-cookbook/pull/258#discussion_r974238525



##########
java/source/dataset.rst:
##########
@@ -213,6 +231,87 @@ Query Data Content For File
    2    Gladis
    3    Juan
 
+Lets try to read a parquet gzip compressed file with 06 row groups:
+
+.. code-block::
+
+   $ parquet-tools meta data4_3rg_gzip.parquet
+
+   file schema: schema
+   age:         OPTIONAL INT64 R:0 D:1
+   name:        OPTIONAL BINARY L:STRING R:0 D:1
+   row group 1: RC:4 TS:182 OFFSET:4
+   row group 2: RC:4 TS:190 OFFSET:420
+   row group 3: RC:3 TS:179 OFFSET:838
+
+In this case, we are configuring ScanOptions batchSize argument equals to 20 
rows, it's greater than
+04 rows used on the file, then 04 rows is used on the program execution 
instead of 20 rows requested.

Review Comment:
   Ditto - I think this is better as API documentation



##########
java/source/demo/pom.xml:
##########
@@ -25,10 +25,16 @@
       </extension>
     </extensions>
     </build>
+    <repositories><!-- This is temporary only for Dataset las version testing 
purpose -->

Review Comment:
   Can we get this worked out with #253?



##########
java/source/dataset.rst:
##########
@@ -25,6 +25,24 @@ Dataset
 
 .. contents::
 
+Arrow Java Dataset offer native functionalities consuming native artifacts 
such as:

Review Comment:
   Can we move this to be with the rest of the text above the ToC?



##########
java/source/dataset.rst:
##########
@@ -25,6 +25,24 @@ Dataset
 
 .. contents::
 
+Arrow Java Dataset offer native functionalities consuming native artifacts 
such as:
+    - JNI Arrow C++ Dataset: libarrow_dataset_jni (dylib/so):
+        To create C++ natively objects Schema, Dataset, Scanner and export 
that as a references (long id).
+    - JNI Arrow C Data Interface: libarrow_cdata_jni (dylib/so):
+        To get C++ Recordbacth.
+
+Current supported file format Datasets are:
+    - Parquet
+    - Arrow IPC
+    - ORC
+
+Consider that file format input is an URI it means HDFS/S3 are also supported.

Review Comment:
   ```suggestion
   Currently supported file formats are:
   - Apache Arrow (`.arrow`)
   - Apache ORC (`.orc`)
   - Apache Parquet (`.parquet`)
   ```



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.

Review Comment:
   ```suggestion
   Let's read an Arrow file with 3 record batches, each with 3 rows.
   ```



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+

Review Comment:
   ```suggestion
   ```



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.

Review Comment:
   ```suggestion
   ```



##########
java/source/dataset.rst:
##########
@@ -213,6 +231,87 @@ Query Data Content For File
    2    Gladis
    3    Juan
 
+Lets try to read a parquet gzip compressed file with 06 row groups:
+
+.. code-block::
+
+   $ parquet-tools meta data4_3rg_gzip.parquet
+
+   file schema: schema
+   age:         OPTIONAL INT64 R:0 D:1
+   name:        OPTIONAL BINARY L:STRING R:0 D:1
+   row group 1: RC:4 TS:182 OFFSET:4
+   row group 2: RC:4 TS:190 OFFSET:420
+   row group 3: RC:3 TS:179 OFFSET:838
+
+In this case, we are configuring ScanOptions batchSize argument equals to 20 
rows, it's greater than
+04 rows used on the file, then 04 rows is used on the program execution 
instead of 20 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/parquetfiles/data4_3rg_gzip.parquet";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 20);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               int totalBatchSize = 0;
+               final int[] count = {1};

Review Comment:
   Why is count an array?



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============

Review Comment:
   Ditto below



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               final int[] count = {1};
+               while (reader.loadNextBatch()) {
+                   try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
+                       System.out.println("Number of rows per batch["+ 
count[0]++ +"]: " + root.getRowCount());
+                   }
+               }
+           } catch (IOException e) {
+               e.printStackTrace();
+           }
+       });
+   } catch (Exception e) {
+       e.printStackTrace();
+   }
+
+.. testoutput::
+
+   Number of rows per batch[1]: 3
+   Number of rows per batch[2]: 3
+   Number of rows per batch[3]: 3
+
+Query ORC File
+==============
+
+Let query information for a ORC file.
+
+Query Data Content For File
+***************************
+
+Reading an ORC ZLib compressed file that contains 385 stripe with 5000 rows 
written each one.

Review Comment:
   ```suggestion
   Let's read an ORC file with zlib compression 385 stripes, each with 5000 
rows.
   ```



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               final int[] count = {1};
+               while (reader.loadNextBatch()) {
+                   try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
+                       System.out.println("Number of rows per batch["+ 
count[0]++ +"]: " + root.getRowCount());
+                   }
+               }
+           } catch (IOException e) {
+               e.printStackTrace();
+           }
+       });
+   } catch (Exception e) {
+       e.printStackTrace();
+   }
+
+.. testoutput::
+
+   Number of rows per batch[1]: 3
+   Number of rows per batch[2]: 3
+   Number of rows per batch[3]: 3
+
+Query ORC File
+==============
+
+Let query information for a ORC file.
+
+Query Data Content For File
+***************************
+
+Reading an ORC ZLib compressed file that contains 385 stripe with 5000 rows 
written each one.
+
+.. code-block::
+
+   $ orc-metadata demo-11-zlib.orc | more
+
+   { "name": "demo-11-zlib.orc",
+     "type": 
"struct<_col0:int,_col1:string,_col2:string,_col3:string,_col4:int,_col5:string,_col6:int,_col7:int,_col8:int>",
+     "stripe count": 385,
+     "compression": "zlib", "compression block": 262144,
+     "stripes": [
+       { "stripe": 0, "rows": 5000,
+         "offset": 3, "length": 1031,
+         "index": 266, "data": 636, "footer": 129
+       },
+   ...
+
+In this case, we are configuring ScanOptions batchSize argument equals to 4000 
rows, it's lower than
+5000 rows used on the file, then 4000 rows is used on the program execution.

Review Comment:
   Again, I think it'd be better to show this as a separate recipe



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============

Review Comment:
   ```suggestion
   Query Arrow Files
   =================
   ```



##########
java/source/dataset.rst:
##########
@@ -213,6 +231,87 @@ Query Data Content For File
    2    Gladis
    3    Juan
 
+Lets try to read a parquet gzip compressed file with 06 row groups:

Review Comment:
   ```suggestion
   Let's try to read a Parquet file with gzip compression and 6 row groups:
   ```



##########
java/source/dataset.rst:
##########
@@ -25,6 +25,24 @@ Dataset
 
 .. contents::
 
+Arrow Java Dataset offer native functionalities consuming native artifacts 
such as:
+    - JNI Arrow C++ Dataset: libarrow_dataset_jni (dylib/so):
+        To create C++ natively objects Schema, Dataset, Scanner and export 
that as a references (long id).
+    - JNI Arrow C Data Interface: libarrow_cdata_jni (dylib/so):
+        To get C++ Recordbacth.

Review Comment:
   I don't think we want to talk about implementation details here



##########
java/source/dataset.rst:
##########
@@ -25,6 +25,24 @@ Dataset
 
 .. contents::
 
+Arrow Java Dataset offer native functionalities consuming native artifacts 
such as:
+    - JNI Arrow C++ Dataset: libarrow_dataset_jni (dylib/so):
+        To create C++ natively objects Schema, Dataset, Scanner and export 
that as a references (long id).
+    - JNI Arrow C Data Interface: libarrow_cdata_jni (dylib/so):
+        To get C++ Recordbacth.
+
+Current supported file format Datasets are:
+    - Parquet
+    - Arrow IPC
+    - ORC
+
+Consider that file format input is an URI it means HDFS/S3 are also supported.
+
+.. note::
+
+    The ScanOptions batchSize argument takes effect only if it is set to a 
value
+    smaller than the number of rows in the recordbatch.

Review Comment:
   I think this is better as API documentation



##########
java/source/dataset.rst:
##########
@@ -25,6 +25,24 @@ Dataset
 
 .. contents::
 
+Arrow Java Dataset offer native functionalities consuming native artifacts 
such as:
+    - JNI Arrow C++ Dataset: libarrow_dataset_jni (dylib/so):
+        To create C++ natively objects Schema, Dataset, Scanner and export 
that as a references (long id).
+    - JNI Arrow C Data Interface: libarrow_cdata_jni (dylib/so):
+        To get C++ Recordbacth.
+
+Current supported file format Datasets are:
+    - Parquet
+    - Arrow IPC
+    - ORC
+
+Consider that file format input is an URI it means HDFS/S3 are also supported.

Review Comment:
   This is best demonstrated as a recipe if possible



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);

Review Comment:
   Honestly it would feel more natural to just put a reasonable value like 
~32768 here



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               final int[] count = {1};

Review Comment:
   Same here - why is count an array?



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               final int[] count = {1};
+               while (reader.loadNextBatch()) {
+                   try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
+                       System.out.println("Number of rows per batch["+ 
count[0]++ +"]: " + root.getRowCount());
+                   }
+               }
+           } catch (IOException e) {
+               e.printStackTrace();
+           }
+       });
+   } catch (Exception e) {
+       e.printStackTrace();
+   }
+
+.. testoutput::
+
+   Number of rows per batch[1]: 3
+   Number of rows per batch[2]: 3
+   Number of rows per batch[3]: 3
+
+Query ORC File
+==============
+
+Let query information for a ORC file.

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] lidavidm commented on a diff in pull request #258: [Java]: Adding Dataset ORC/IPC examples

Reply via email to