[jira] [Commented] (ARROW-8726) [R][Dataset] segfault with a mis-specified partition

Joris Van den Bossche (Jira) Thu, 07 May 2020 08:42:01 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101796#comment-17101796
 ]


Joris Van den Bossche commented on ARROW-8726:
----------------------------------------------

[~jonkeane] are you using released 0.17, or the development version?

I was trying with a python reproducer. And on master, it is not failing with a 
segfault (of course I am doing not exactly the same, but there also have been 
quite some changes in the Dataset class since 0.17 already), but still seeing a 
"peculiar" (wrong) behaviour:

{code}
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds

path = pathlib.Path("temp_dataset")   
(path / "one").mkdir()  
(path / "two").mkdir()

table = pa.table({'col': [1, 2, 3, 4]})                                         
                                                                                
                                           
pq.write_table(table, str(path / "one" / "data.parquet"))                       
                                                                                
                                           
pq.write_table(table, str(path / "two" / "data.parquet"))                       
                                                                                
                                           
{code}

gives:

{code}
In [14]: ds.dataset(path, partitioning=["level", "nothing"]).schema             
                                                                                
                                                   
Out[14]: 
col: int64
level: string
nothing: string

In [18]: ds.dataset(path, partitioning=["level", 
"nothing"]).to_table().to_pandas()                                              
                                                                                
  
Out[18]: 
   col level       nothing
0    1   one  data.parquet
1    2   one  data.parquet
2    3   one  data.parquet
3    4   one  data.parquet
4    1   two  data.parquet
5    2   two  data.parquet
6    3   two  data.parquet
7    4   two  data.parquet
{code}

so where for the second partition field, the file name is used ...

And with a third field, I see:

{code}
In [20]: ds.dataset(path, partitioning=["level", "nothing", 
"else"]).to_table().to_pandas()                                                 
                                                                       
...
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.DatasetFactory.finish()

ArrowInvalid: No segments were available for field 'else'; couldn't infer type
{code}

> [R][Dataset] segfault with a mis-specified partition
> ----------------------------------------------------
>
>                 Key: ARROW-8726
>                 URL: https://issues.apache.org/jira/browse/ARROW-8726
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Jonathan Keane
>            Assignee: Francois Saint-Jacques
>            Priority: Major
>             Fix For: 0.17.1
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8726) [R][Dataset] segfault with a mis-specified partition

Reply via email to