[ 
https://issues.apache.org/jira/browse/ARROW-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252000#comment-17252000
 ] 

Kouhei Sutou commented on ARROW-10958:
--------------------------------------

We have 2 interfaces to read Parquet data in the Arrow C++ layer. Both of 
pyarrow and Arrow GLib use Arrow C++.

1. Using Parquet C++ (like you did)
2. Using Arrow Dataset (like pyarrow does)

1. doesn't support reading Parquet data that has a chunked column (row group in 
Parquet term) of List/Struct as a table yet. But 1. can read each row group as 
a table separately. We can combine multiple tables to one table by 
{{garrow_table_concatenate()}}.

The following code (sorry for using C) will work with 1.:

{noformat}
gint n_row_groups = gparquet_arrow_file_reader_get_n_row_groups(reader);
gint i;
GList *tables = NULL;
for (i = 0; i < n_row_groups; i++) {
  GArrowTable *subtable = gparquet_arrow_file_reader_read_row_group(reader, i, 
NULL, 0, NULL);
  tables = g_list_append(tables, subtable);
}
GArrowTable *table = garrow_table_concatenate(tables->data, tables->next);
g_list_free(tables, g_object_unref);
{noformat}

2. can read Parquet data as a table even when the Parquet data has a chunked 
column of List/Struct. Unfortunately, Arrow GLib doesn't support this feature 
yet. (I'll work on it.)

> [GLib] "Nested data conversions not implemented" through glib, but not 
> through pyarrow
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10958
>                 URL: https://issues.apache.org/jira/browse/ARROW-10958
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: GLib
>    Affects Versions: 2.0.0
>         Environment: macOS Catalina 10.15.7
>            Reporter: Samay Kapadia
>            Priority: Major
>
> Hey all,
> For some context, I am trying to use Arrow's GLib interface through Julia; I 
> have a sense that I can speedup by pandas workflows by using Julia and Apache 
> Arrow.
> I have a 1.7GB parquet file that can be read in about 20s by using pyarrow's 
> parquet reader
> {code:java}
> pq.read_table(path)
> {code}
> I've tried to do the same thing through the GLib interface in Julia, but I 
> see this error instead :(
> {code:python}
> [parquet][arrow][file-reader][read-table]: NotImplemented: Nested data 
> conversions not implemented for chunked array outputs
> {code}
> Arrow was installed using {{brew install apache-arrow-glib}} and it installed 
> version 2.0.0
> Here's my Julia code:
> {code:python}
> using Pkg
> Pkg.add("Gtk")
> using Gtk.GLib
> using Gtk
> path = "..." # contains columns that are lists of strings
> struct _GParquetArrowFileReader
>     parent_instance::Cint
> end
> const GParquetArrowFileReader = _GParquetArrowFileReaderstruct 
> _GParquetArrowFileReaderClass
>     parent_class::Cint
> end
> const GParquetArrowFileReaderClass = _GParquetArrowFileReaderClass
> struct _GArrowTable
>     parent_instance::Cint
> end
> const GArrowTable = _GArrowTable
> struct _GArrowTableClass
>     parent_class::Cint
> end
> const GArrowTableClass = _GArrowTableClass
> function 
> parquet_arrow_file_reader_new_path(path::String)::Ptr{GParquetArrowFileReader}
>     ret::Ptr{GParquetArrowFileReader} = 0
>     GError() do error_check
>         ret = ccall(
>             (:gparquet_arrow_file_reader_new_path, 
> "/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), 
>             Ptr{GParquetArrowFileReader}, 
>             (Ptr{UInt8}, Ptr{Ptr{GError}}), 
>             Gtk.bytestring(path), error_check
>         )
>         ret != 0
>     end
>     ret
> end
> function 
> parquet_arrow_file_reader_read_table(reader::Ptr{GParquetArrowFileReader})::Ptr{GArrowTable}
>     ret::Ptr{GArrowTable} = 0
>     GError() do error_check
>         ret = ccall(
>             (:gparquet_arrow_file_reader_read_table, 
> "/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), 
>             Ptr{GParquetArrowFileReader}, 
>             (Ptr{GParquetArrowFileReader}, Ptr{Ptr{GError}}), 
>             reader, error_check
>         )
>         ret != 0
>     end
>     ret
> end
> reader = parquet_arrow_file_reader_new_path(path)
> tbl = parquet_arrow_file_reader_read_table(reader)
> {code}
> Am I doing something wrong or is there a behavior discrepancy between pyarrow 
> and glib?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to