It would be nice to have an API to look at the file footer (we don't have one in C++ either), I opened
https://issues.apache.org/jira/browse/ARROW-3283 On Fri, Sep 21, 2018 at 10:32 AM Li Jin <ice.xell...@gmail.com> wrote: > > Hi Michael, > > I think ArrowFileReader takes SeekableByteChannel so it's possible to only > read the metadata for each record batches and skip the data. However it is > not implemented. > > If the input Channel is not seekable (for example, a socket channel) then > you would need to read the body for each record batches to get the next > batch, so my hunch is that the performance will be similar whether you read > record batch body into VectorSchemaRoot or just read the bytes. > > If you don't assume your input data is always going to be seekable, I am > not sure there is a quicker way to do this. > > > > On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mkn...@rapidminer.com> wrote: > > > Hi all, > > > > I am looking for a quick way to look up the total row count of a data set > > stored in Arrow’s random access file format using the Java API. Basically, > > a quicker way to do this: > > > > // The reader is in an instance of ArrowFileReader > > List<ArrowBlock> blocks = reader.getRecordBlocks(); > > int nRows = 0; > > for (ArrowBlock block : blocks) { > > reader.loadRecordBatch(block); > > nRows += root.getRowCount(); > > } > > > > My understanding is that the above snippets loads the entire data set > > instead of just the block headers. > > > > To give you some context, I am looking into using Arrow for IPC between a > > JVM and a Python interpreter using a custom data format and PyArrow/Pandas > > respectively. While the streaming API might be a better tool for this job, > > I started out with using files to keep things simple. > > > > Any help would be greatly appreciated – maybe I just missed the right bit > > of documentation. > > > > Thanks, > > Michael