Hi Li,

Thanks for the explanation! I’ll keep the code as is for now (and an eye on 
ARROW-3283). 

As you pointed out, I’ll need another solution for streaming the table over a 
socket anyway.

To clarify, my code does read the actual data in a second pass. However, doing 
so without knowing how many rows to expect is very expensive.

Thanks again,
Michael

> On 21. Sep 2018, at 16:31, Li Jin <ice.xell...@gmail.com> wrote:
> 
> Hi Michael,
> 
> I think ArrowFileReader takes SeekableByteChannel so it's possible to only
> read the metadata for each record batches and skip the data. However it is
> not implemented.
> 
> If the input Channel is not seekable (for example, a socket channel) then
> you would need to read the body for each record batches to get the next
> batch, so my hunch is that the performance will be similar whether you read
> record batch body into VectorSchemaRoot or just read the bytes.
> 
> If you don't assume your input data is always going to be seekable, I am
> not sure there is a quicker way to do this.
> 
> 
> 
> On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mkn...@rapidminer.com> wrote:
> 
>> Hi all,
>> 
>> I am looking for a quick way to look up the total row count of a data set
>> stored in Arrow’s random access file format using the Java API. Basically,
>> a quicker way to do this:
>> 
>> // The reader is in an instance of ArrowFileReader
>> List<ArrowBlock> blocks = reader.getRecordBlocks();
>> int nRows = 0;
>> for (ArrowBlock block : blocks) {
>>    reader.loadRecordBatch(block);
>>    nRows += root.getRowCount();
>> }
>> 
>> My understanding is that the above snippets loads the entire data set
>> instead of just the block headers.
>> 
>> To give you some context, I am looking into using Arrow for IPC between a
>> JVM and a Python interpreter using a custom data format and PyArrow/Pandas
>> respectively. While the streaming API might be a better tool for this job,
>> I started out with using files to keep things simple.
>> 
>> Any help would be greatly appreciated – maybe I just missed the right bit
>> of documentation.
>> 
>> Thanks,
>> Michael

Reply via email to