davisusanibar commented on code in PR #14382:
URL: https://github.com/apache/arrow/pull/14382#discussion_r993556131
##########
docs/source/java/dataset.rst:
##########
@@ -32,31 +32,49 @@ is not designed only for querying files but can be extended
to serve all
possible data sources such as from inter-process communication or from other
network locations, etc.
+.. contents::
+
Getting Started
===============
+Currently supported file formats are:
+
+- Apache Arrow (`.arrow`)
+- Apache ORC (`.orc`)
+- Apache Parquet (`.parquet`)
+- Comma-Separated Values (`.csv`)
+
Below shows a simplest example of using Dataset to query a Parquet file in
Java:
.. code-block:: Java
// read data from file /opt/example.parquet
String uri = "file:/opt/example.parquet";
- BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
- DatasetFactory factory = new FileSystemDatasetFactory(allocator,
- NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
- Dataset dataset = factory.finish();
- Scanner scanner = dataset.newScan(new ScanOptions(100)));
- List<ArrowRecordBatch> batches = StreamSupport.stream(
- scanner.scan().spliterator(), false)
- .flatMap(t -> stream(t.execute()))
- .collect(Collectors.toList());
-
- // do something with read record batches, for example:
- analyzeArrowData(batches);
-
- // finished the analysis of the data, close all resources:
- AutoCloseables.close(batches);
- AutoCloseables.close(factory, dataset, scanner);
+ try (
+ BufferAllocator allocator = new RootAllocator();
+ DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+ allocator, NativeMemoryPool.getDefault(),
+ FileFormat.PARQUET, uri);
+ Dataset dataset = datasetFactory.finish();
+ Scanner scanner = dataset.newScan(options);
Review Comment:
Thanks to catch that, I just added that `options` part
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]