OmBiradar commented on PR #50158:
URL: https://github.com/apache/arrow/pull/50158#issuecomment-4827132870

   Hey @wgtmac
   i looked into the failing test, which is specifically - 
`TestParquetFileFormatScan.ScanRecordBatchReaderProjectedNested/0Threaded16b1024r`
 This test requires reading of a nested parquet file having struct columns.
   I used gdb to obtain a backtrace after I was confident the program has hit a 
"deadlock" of some sort.
   Analysing the deadlock, I found that the
   
   1. threads which are suppose to read the structs, hand off the reading of 
its children fields to other threads and go into a "wait" state.
   2. The child field reading threads cannot execute because the thread pool is 
fully saturated with threads which are on "wait"
   
   This creates the deadlock due to the cross dependency between threads and 
threads spend time waiting on each other.
   Thus, I believe this is a sync-async problem, where generally a blocking 
thread should not spawn other threads and wait on them. Here a async type 
thread management would be nice.
   As there is also a note in the `cpp/src/parquet/arrow/reader.cc` file in 
80fe83a4fd2cc6d119eaf547cee24a2cdf1d28d8 by lidavidm and pitrou where it says
   
   > Making the Parquet reader truly asynchronous requires heavy refactoring, 
so the generator piggybacks on ReadRangeCache.
   
   I believe that it enables the multi-threaded reading of row groups, but it 
does not consider threads producing new threads to read various fields in a 
struct.
   I really don't have much idea on how to approach this, could you please 
provide any help @wgtmac


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to