davisusanibar commented on code in PR #14382:
URL: https://github.com/apache/arrow/pull/14382#discussion_r993854275
##########
docs/source/java/dataset.rst:
##########
@@ -32,31 +32,50 @@ is not designed only for querying files but can be extended
to serve all
possible data sources such as from inter-process communication or from other
network locations, etc.
+.. contents::
+
Getting Started
===============
+Currently supported file formats are:
+
+- Apache Arrow (`.arrow`)
+- Apache ORC (`.orc`)
+- Apache Parquet (`.parquet`)
+- Comma-Separated Values (`.csv`)
+
Below shows a simplest example of using Dataset to query a Parquet file in
Java:
.. code-block:: Java
// read data from file /opt/example.parquet
String uri = "file:/opt/example.parquet";
- BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
- DatasetFactory factory = new FileSystemDatasetFactory(allocator,
- NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
- Dataset dataset = factory.finish();
- Scanner scanner = dataset.newScan(new ScanOptions(100)));
- List<ArrowRecordBatch> batches = StreamSupport.stream(
- scanner.scan().spliterator(), false)
- .flatMap(t -> stream(t.execute()))
- .collect(Collectors.toList());
-
- // do something with read record batches, for example:
- analyzeArrowData(batches);
-
- // finished the analysis of the data, close all resources:
- AutoCloseables.close(batches);
- AutoCloseables.close(factory, dataset, scanner);
+ ScanOptions options = new ScanOptions(/*batchSize*/ 5);
+ try (
+ BufferAllocator allocator = new RootAllocator();
Review Comment:
Changed
##########
docs/source/java/dataset.rst:
##########
@@ -65,6 +84,9 @@ Below shows a simplest example of using Dataset to query a
Parquet file in Java:
aware container ``VectorSchemaRoot`` by which user could be able to access
decoded data conveniently in Java.
+ The ``ScanOptions`` `batchSize` argument takes effect only if it is set to
a value
Review Comment:
Changed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]