[
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101796#comment-17101796
]
Joris Van den Bossche commented on ARROW-8726:
----------------------------------------------
[~jonkeane] are you using released 0.17, or the development version?
I was trying with a python reproducer. And on master, it is not failing with a
segfault (of course I am doing not exactly the same, but there also have been
quite some changes in the Dataset class since 0.17 already), but still seeing a
"peculiar" (wrong) behaviour:
{code}
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
path = pathlib.Path("temp_dataset")
(path / "one").mkdir()
(path / "two").mkdir()
table = pa.table({'col': [1, 2, 3, 4]})
pq.write_table(table, str(path / "one" / "data.parquet"))
pq.write_table(table, str(path / "two" / "data.parquet"))
{code}
gives:
{code}
In [14]: ds.dataset(path, partitioning=["level", "nothing"]).schema
Out[14]:
col: int64
level: string
nothing: string
In [18]: ds.dataset(path, partitioning=["level",
"nothing"]).to_table().to_pandas()
Out[18]:
col level nothing
0 1 one data.parquet
1 2 one data.parquet
2 3 one data.parquet
3 4 one data.parquet
4 1 two data.parquet
5 2 two data.parquet
6 3 two data.parquet
7 4 two data.parquet
{code}
so where for the second partition field, the file name is used ...
And with a third field, I see:
{code}
In [20]: ds.dataset(path, partitioning=["level", "nothing",
"else"]).to_table().to_pandas()
...
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in
pyarrow._dataset.DatasetFactory.finish()
ArrowInvalid: No segments were available for field 'else'; couldn't infer type
{code}
> [R][Dataset] segfault with a mis-specified partition
> ----------------------------------------------------
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Jonathan Keane
> Assignee: Francois Saint-Jacques
> Priority: Major
> Fix For: 0.17.1
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning
> causes a segfault. Though this is clearly input error, it would be nice if
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
> filter(cyl > 8) %>%
> collect()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)