[ 
https://issues.apache.org/jira/browse/PARQUET-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334441#comment-15334441
 ] 

Wes McKinney commented on PARQUET-474:
--------------------------------------

Took a look at this. Reading a range of bytes from a local file must be made an 
atomic operation so that we can lock the source while performing a seek then 
read. Currently we have code like

{code}
  source_->Seek(filesize - FOOTER_SIZE);
  int64_t bytes_read = source_->Read(FOOTER_SIZE, footer_buffer);
{code}

This can be made thread-safe by having an API like

{code}
source_->ReadAt(filesize - FOOTER_SIZE, FOOTER_SIZE, footer_buffer);
{code}

We already have

{code}
  std::shared_ptr<Buffer> ReadAt(int64_t pos, int64_t nbytes);
{code}

in which we can also block other threads (as needed) while performing the read.

The stream classes are slightly different rabbit hole -- to get the best 
performance we'd want to have a buffered stream that continues to buffer data 
from the source in a background thread (presently it is synchronous / on-demand 
buffering)

> InputStream and RandomAccessdSource classes are not threadsafe
> --------------------------------------------------------------
>
>                 Key: PARQUET-474
>                 URL: https://issues.apache.org/jira/browse/PARQUET-474
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> We need to ensure that files can be processed in multithreaded applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to