[ 
https://issues.apache.org/jira/browse/ARROW-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308073#comment-17308073
 ] 

Thomas Blauth commented on ARROW-12066:
---------------------------------------

[~jorisvandenbossche] thank you very much for your answer and the advice! 
However, if I try  {{filter = pa.dataset.field("A").is_null()}} with 3.0.0 I get
{code:python}
AttributeError: 'pyarrow._dataset.Expression' object has no attribute 'is_null'
{code}
Was {{.is_null()}} maybe also added later on?

> [Python] Dataset API seg fault when filtering string column for None
> --------------------------------------------------------------------
>
>                 Key: ARROW-12066
>                 URL: https://issues.apache.org/jira/browse/ARROW-12066
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: macOS 10.15.7
>            Reporter: Thomas Blauth
>            Priority: Major
>
> Trying to load a parquet file using the dataset api leads to a segmentation 
> fault when filtering string columns for None values.
> Minimal reproducing example: 
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset
> import pyarrow.parquet
> import pandas as pd
> path = "./test.parquet"
> df = pd.DataFrame({"A": ("a", "b", None)})
> pa.parquet.write_table(pa.table(df), path)
> ds = pa.dataset.dataset(path, format="parquet")
> filter = pa.dataset.field("A") == pa.dataset.scalar(None)
> table = ds.to_table(filter=filter)
> {code}
> Backtrace:
> {code:bash}
> (lldb) target create "/usr/local/mambaforge/envs/xxx/bin/python"
> Current executable set to '/usr/local/mambaforge/envs/xxx/bin/python' 
> (x86_64).
> (lldb) settings set -- target.run-args  "./tmp.py"
> (lldb) r
> Process 35235 launched: '/usr/local/mambaforge/envs/xxx/bin/python' (x86_64)
> Process 35235 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x9)
>     frame #0: 0x000000010314be48 libarrow.300.0.0.dylib`arrow::Status 
> arrow::VisitScalarInline<arrow::ScalarHashImpl>(arrow::Scalar const&, 
> arrow::ScalarHashImpl*) + 104
> libarrow.300.0.0.dylib`arrow::VisitScalarInline<arrow::ScalarHashImpl>:
> ->  0x10314be48 <+104>: cmpb   $0x0, 0x9(%rax)
>     0x10314be4c <+108>: je     0x10314c0bc               ; <+732>
>     0x10314be52 <+114>: movq   0x10(%rax), %rdi
>     0x10314be56 <+118>: movq   0x20(%rax), %rsi
> Target 0: (python) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x9)
>   * frame #0: 0x000000010314be48 libarrow.300.0.0.dylib`arrow::Status 
> arrow::VisitScalarInline<arrow::ScalarHashImpl>(arrow::Scalar const&, 
> arrow::ScalarHashImpl*) + 104
>     frame #1: 0x000000010314bd4f 
> libarrow.300.0.0.dylib`arrow::ScalarHashImpl::AccumulateHashFrom(arrow::Scalar
>  const&) + 111
>     frame #2: 0x0000000103134bca 
> libarrow.300.0.0.dylib`arrow::Scalar::Hash::hash(arrow::Scalar const&) + 42
>     frame #3: 0x0000000132fa0ea8 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::Expression::hash() const + 264
>     frame #4: 0x0000000132fc913c 
> libarrow_dataset.300.0.0.dylib`std::__1::__hash_const_iterator<std::__1::__hash_node<arrow::dataset::Expression,
>  void*>*> std::__1::__hash_table<arrow::dataset::Expression, 
> arrow::dataset::Expression::Hash, 
> std::__1::equal_to<arrow::dataset::Expression>, 
> std::__1::allocator<arrow::dataset::Expression> 
> >::find<arrow::dataset::Expression>(arrow::dataset::Expression const&) const 
> + 28
>     frame #5: 0x0000000132faca9b 
> libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::dataset::Expression> 
> arrow::dataset::Modify<arrow::dataset::Canonicalize(arrow::dataset::Expression,
>  arrow::compute::ExecContext*)::$_1, 
> arrow::dataset::Canonicalize(arrow::dataset::Expression, 
> arrow::compute::ExecContext*)::$_9>(arrow::dataset::Expression, 
> arrow::dataset::Canonicalize(arrow::dataset::Expression, 
> arrow::compute::ExecContext*)::$_1 const&, 
> arrow::dataset::Canonicalize(arrow::dataset::Expression, 
> arrow::compute::ExecContext*)::$_9 const&) + 123
>     frame #6: 0x0000000132fac623 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression,
>  arrow::compute::ExecContext*) + 131
>     frame #7: 0x0000000132fac76d 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::Canonicalize(arrow::dataset::Expression,
>  arrow::compute::ExecContext*) + 461
>     frame #8: 0x0000000132fb00cb 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression,
>  arrow::dataset::Expression const&)::$_10::operator()() const + 75
>     frame #9: 0x0000000132faf6b5 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::SimplifyWithGuarantee(arrow::dataset::Expression,
>  arrow::dataset::Expression const&) + 517
>     frame #10: 0x0000000132f893f8 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::Dataset::GetFragments(arrow::dataset::Expression)
>  + 88
>     frame #11: 0x0000000132f8d25c 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>,
>  std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > 
> const&, 
> arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>)::operator()(std::__1::shared_ptr<arrow::dataset::Dataset>)
>  const + 76
>     frame #12: 0x0000000132f8cd6c 
> libarrow_dataset.300.0.0.dylib`arrow::MapIterator<arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>,
>  std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > 
> const&, 
> arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>),
>  std::__1::shared_ptr<arrow::dataset::Dataset>, 
> arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > >::Next() + 
> 316
>     frame #13: 0x0000000132f8cb27 
> libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  > > 
> arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  > 
> >::Next<arrow::MapIterator<arrow::dataset::GetFragmentsFromDatasets(std::__1::vector<std::__1::shared_ptr<arrow::dataset::Dataset>,
>  std::__1::allocator<std::__1::shared_ptr<arrow::dataset::Dataset> > > 
> const&, 
> arrow::dataset::Expression)::'lambda'(std::__1::shared_ptr<arrow::dataset::Dataset>),
>  std::__1::shared_ptr<arrow::dataset::Dataset>, 
> arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> > > >(void*) + 
> 39
>     frame #14: 0x0000000132f8dcdb 
> libarrow_dataset.300.0.0.dylib`arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  > >::Next() + 43
>     frame #15: 0x0000000132f8d692 
> libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  >::Next() + 258
>     frame #16: 0x0000000132f8d477 
> libarrow_dataset.300.0.0.dylib`arrow::Result<std::__1::shared_ptr<arrow::dataset::Fragment>
>  > arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment> 
> >::Next<arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::Fragment> 
> > >(void*) + 39
>     frame #17: 0x0000000132f8de0b 
> libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  >::Next() + 43
>     frame #18: 0x0000000132fffe80 
> libarrow_dataset.300.0.0.dylib`arrow::MapIterator<arrow::dataset::GetScanTaskIterator(arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  >, std::__1::shared_ptr<arrow::dataset::ScanOptions>, 
> std::__1::shared_ptr<arrow::dataset::ScanContext>)::'lambda'(std::__1::shared_ptr<arrow::dataset::Fragment>),
>  std::__1::shared_ptr<arrow::dataset::Fragment>, 
> arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > >::Next() + 
> 48
>     frame #19: 0x0000000132fffd47 
> libarrow_dataset.300.0.0.dylib`arrow::Result<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  > > 
> arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  > 
> >::Next<arrow::MapIterator<arrow::dataset::GetScanTaskIterator(arrow::Iterator<std::__1::shared_ptr<arrow::dataset::Fragment>
>  >, std::__1::shared_ptr<arrow::dataset::ScanOptions>, 
> std::__1::shared_ptr<arrow::dataset::ScanContext>)::'lambda'(std::__1::shared_ptr<arrow::dataset::Fragment>),
>  std::__1::shared_ptr<arrow::dataset::Fragment>, 
> arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> > > >(void*) + 
> 39
>     frame #20: 0x0000000133003dcb 
> libarrow_dataset.300.0.0.dylib`arrow::Iterator<arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  > >::Next() + 43
>     frame #21: 0x0000000133003782 
> libarrow_dataset.300.0.0.dylib`arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  >::Next() + 258
>     frame #22: 0x0000000133003567 
> libarrow_dataset.300.0.0.dylib`arrow::Result<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  > arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask> 
> >::Next<arrow::FlattenIterator<std::__1::shared_ptr<arrow::dataset::ScanTask> 
> > >(void*) + 39
>     frame #23: 0x0000000132fd479b 
> libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  >::Next() + 43
>     frame #24: 0x0000000132fd44e8 
> libarrow_dataset.300.0.0.dylib`arrow::Iterator<std::__1::shared_ptr<arrow::dataset::ScanTask>
>  >::RangeIterator::Next() + 88
>     frame #25: 0x0000000132ffe43d 
> libarrow_dataset.300.0.0.dylib`arrow::dataset::Scanner::ToTable() + 589
>     frame #26: 0x0000000132f2963a 
> _dataset.cpython-39-darwin.so`__pyx_pw_7pyarrow_8_dataset_7Scanner_13to_table(_object*,
>  _object*) + 74
>     frame #27: 0x0000000132ef47d4 
> _dataset.cpython-39-darwin.so`__Pyx_PyObject_CallNoArg(_object*) + 132
>     frame #28: 0x0000000132ef0cc9 
> _dataset.cpython-39-darwin.so`__pyx_pw_7pyarrow_8_dataset_7Dataset_14to_table(_object*,
>  _object*, _object*) + 569
>     frame #29: 0x00000001000d5a04 python`cfunction_call + 52
>     frame #30: 0x0000000100074998 python`_PyObject_MakeTpCall + 136
>     frame #31: 0x00000001001aa8f3 python`call_function + 323
>     frame #32: 0x00000001001a843f python`_PyEval_EvalFrameDefault + 45039
>     frame #33: 0x000000010019bc04 python`_PyEval_EvalCode + 548
>     frame #34: 0x000000010020ec51 python`pyrun_file + 321
>     frame #35: 0x000000010020e49c python`pyrun_simple_file + 412
>     frame #36: 0x000000010020e2ad python`PyRun_SimpleFileExFlags + 109
>     frame #37: 0x0000000100239ed9 python`pymain_run_file + 329
>     frame #38: 0x00000001002395c0 python`pymain_run_python + 992
>     frame #39: 0x0000000100239185 python`Py_RunMain + 37
>     frame #40: 0x000000010023a8f1 python`pymain_main + 49
>     frame #41: 0x0000000100001b48 python`main + 56
>     frame #42: 0x00007fff73ab2cc9 libdyld.dylib`start + 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to