[ 
https://issues.apache.org/jira/browse/PARQUET-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750507#comment-15750507
 ] 

William Forson commented on PARQUET-799:
----------------------------------------

Ok, so the problem is that {{parquet::MemoryMapSource::CloseFile}} is getting 
called via three different paths:

1) {{ParquetFileReader::Close}}, which is invoked explicitly in the body of 
{{~ParquetFileReader}}
2) {{SerializedFile::Close}}, which is invoked explicitly in the body of 
{{~SerializedFile}} (which, in turn, is invoked via the default destructor for 
{{ParquetFileReader}})
3) {{~MemoryMapSource}}, which is invoked via the default destructor of 
{{SerializedFile}}, etc.

In a multi-threaded environment where lots of files are being read in parallel, 
disaster occurs when one thread's {{mmap}} call gets interleaved between a pair 
of another thread's {{munmap}} calls.

>From a glance, I suspect this situation could be fixed completely by adding 
>the following statement right after [this 
>one|https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/input.cc#L137]:

bq. data_ = nullptr;

(also, from another angle, passing {{memory_map=false}} into 
[{{ParquetFileReader::OpenFile}}|https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L83]
 avoids the problem altogether)

I suppose "contributions are welcome" is ultimately a fair response in open 
source, but as I won't have the bandwidth to contribute to this project anytime 
_too_ soon, one request/suggestion: please try to mark any "sample" code in 
your codebase accordingly (as some of it already is, e.g. 
{{ParquetFileReader::DebugPrint}}).

For example, even now that you have told me that the 
{{ParquetFileReader::OpenFile}} function and the 
{{LocalFileSource}}/{{MemoryMapSource}} classes are "sample implementations", I 
can find absolutely nothing in the source that indicates this. Because I 
assumed I was using code that's more or less production-ready, it took me a 
_ton_ longer to find this bug than it otherwise would have.

At any rate, Wes: thank you so much for being responsive on both of my issues 
this week. And Deepak: I really appreciate your apparent willingness to help me 
debug in detail.

> concurrent usage of the file reader API
> ---------------------------------------
>
>                 Key: PARQUET-799
>                 URL: https://issues.apache.org/jira/browse/PARQUET-799
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: William Forson
>
> I've recently been debugging a segfault that occurs when concurrently reading 
> (distinct) parquet files from multiple threads.
> I initially assumed this was a reasonable thing to do, since the project 
> README doesn't say anything about concurrency one way or the other. But then 
> I encountered [this TODO 
> comment|https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/page.h#L35]:
> {quote}
> // TODO: Parallel processing is not yet safe because of memory-ownership
> // semantics (the PageReader may or may not own the memory referenced by a
> // page)
> {quote}
> And it has got me wondering: is parquet-cpp fundamentally NOT thread-safe, 
> even for the use case of reading a single file per thread at any given time? 
> Or is it basically thread-safe with a couple gotchas?
> Also, jfyi, I'm currently running against a build which incorporates [this 
> change|https://github.com/apache/parquet-cpp/commit/002466539f6aba7bf1f885b66f61f302ed88fa6b].
> (aside: my motivation for recently posting an issue re. {{THRIFT_HOME}} was 
> to rule out any ABI weirdness that might result from building parquet-cpp 
> against a different version of thrift than the applications that ultimately 
> consume parquet-cpp)
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to