caseykneale opened a new issue, #7394:
URL: https://github.com/apache/datafusion/issues/7394

   ### Describe the bug
   
   The data I am using is in a normalized parquet format and each "table" is in 
its own file, each registered to the SessionContext instance, and can be 
queried with a variety of other queries. The problematic query has maybe 4 
`JOIN`s (outer, inner, etc) and it ends with a `NOT EXISTS` clause on a 
subquery with another set of the roughly the same 4 `JOIN`s on different 
tables. The outer join occurs inside of a view which is shared by the other 3 
inner joins. The query planning succeeds. The query runs for a while, maybe 15 
minutes, and then appears as though it has completed (CPU cores spin down, RAM 
consumption goes down to baseline). The SegFault happens during the collection 
of the DataFrame itself(it's `await`ed on) before the RecordBatches are 
collected from the Dataframe. For what its worth, the dataframe should be empty 
at the end of this query as its serving as a control for a unit test.
   
   Then segfault currently occurs on an intel Mac. I saw an open issue about 
seg faulting in unit tests 
https://github.com/apache/arrow-datafusion/issues/5693 and don't know whether 
or not this could be the same issue.  
   
   I see a few blocks of unsafe code in the project, most of which look benign, 
 but I haven't ruled out a stack overflow scenario. Not sure where to poke at. 
May try adjusting `RUST_MIN_STACK` to see if that helps? Or memoizing the 
subquery results before the `NOT EXIST` call?
   
   Any suggestions appreciated.
   
   ### To Reproduce
   
   I can't share the data to reproduce this or the code unfortunately but 
something tells me I could make a MRE as I doubt this behavior is exclusive to 
this type of query.
   
   ### Expected behavior
   
   This may sound terse but I mean this in the most polite way possible. 
Ideally queries do not segfault.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to