[GitHub] [arrow-cookbook] davisusanibar commented on a diff in pull request #258: [Java]: Adding Dataset ORC/IPC examples

GitBox Thu, 13 Oct 2022 13:45:56 -0700


davisusanibar commented on code in PR #258:
URL: https://github.com/apache/arrow-cookbook/pull/258#discussion_r995107135



##########
java/source/dataset.rst:
##########
@@ -213,6 +231,87 @@ Query Data Content For File
    2    Gladis
    3    Juan
 
+Lets try to read a parquet gzip compressed file with 06 row groups:
+
+.. code-block::
+
+   $ parquet-tools meta data4_3rg_gzip.parquet
+
+   file schema: schema
+   age:         OPTIONAL INT64 R:0 D:1
+   name:        OPTIONAL BINARY L:STRING R:0 D:1
+   row group 1: RC:4 TS:182 OFFSET:4
+   row group 2: RC:4 TS:190 OFFSET:420
+   row group 3: RC:3 TS:179 OFFSET:838
+
+In this case, we are configuring ScanOptions batchSize argument equals to 20 
rows, it's greater than
+04 rows used on the file, then 04 rows is used on the program execution 
instead of 20 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/parquetfiles/data4_3rg_gzip.parquet";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 20);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               int totalBatchSize = 0;
+               final int[] count = {1};

Review Comment:
   Changed



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);

Review Comment:
   Changed



##########
java/source/dataset.rst:
##########
@@ -317,5 +419,136 @@ In case we need to project only certain columns we could 
configure ScanOptions w
    Gladis
    Juan
 
+Query IPC File
+==============
+
+Let query information for a IPC file.
+
+Query Data Content For File
+***************************
+
+Reading an IPC file that contains 03 Recordbatch with 03 rows written each one.
+
+In this case, we are configuring ScanOptions batchSize argument equals to 05 
rows, it's greater than
+03 rows used on the file, then 03 rows is used on the program execution 
instead of 05 rows requested.
+
+.. testcode::
+
+   import org.apache.arrow.dataset.file.FileFormat;
+   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+   import org.apache.arrow.dataset.jni.NativeMemoryPool;
+   import org.apache.arrow.dataset.scanner.ScanOptions;
+   import org.apache.arrow.dataset.scanner.Scanner;
+   import org.apache.arrow.dataset.source.Dataset;
+   import org.apache.arrow.dataset.source.DatasetFactory;
+   import org.apache.arrow.memory.BufferAllocator;
+   import org.apache.arrow.memory.RootAllocator;
+   import org.apache.arrow.vector.VectorSchemaRoot;
+   import org.apache.arrow.vector.ipc.ArrowReader;
+
+   import java.io.IOException;
+
+   String uri = "file:" + System.getProperty("user.dir") + 
"/thirdpartydeps/arrowfiles/random_access.arrow";
+   ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+   try (
+       BufferAllocator allocator = new RootAllocator();
+       DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, 
NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, uri);
+       Dataset dataset = datasetFactory.finish();
+       Scanner scanner = dataset.newScan(options)
+   ) {
+       scanner.scan().forEach(scanTask -> {
+           try (ArrowReader reader = scanTask.execute()) {
+               final int[] count = {1};

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] davisusanibar commented on a diff in pull request #258: [Java]: Adding Dataset ORC/IPC examples

Reply via email to