Re: Java/Scala: efficient reading of Parquet into Arrow?

Wes McKinney Thu, 23 May 2019 06:58:58 -0700

hi Joris,

The Apache Parquet mailing list is


d...@parquet.apache.org

I'm copying the list here

AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
otherwise). There are some vectorized Java-based readers in the wild:
in Dremio [1] and Apache Spark, at least. I'm interested to see a
reusable library that supports vectorized Arrow reads in Java.

- Wes

[1]: https://github.com/dremio/dremio-oss

On Thu, May 23, 2019 at 8:54 AM Joris Peeters
<joris.mg.peet...@gmail.com> wrote:
>
> Hello,
>
> I'm trying to read a Parquet file from disk into Arrow in memory, in Scala.
> I'm wondering what the most efficient approach is, especially for the
> reading part. I'm aware that Parquet reading is perhaps beyond the scope of
> this mailing list but,
>
> - I believe Arrow and Parquet are closely intertwined these days?
> - I can't find an appropriate Parquet mailing list.
>
> Any pointers would be appreciated!
>
> Below is the code I currently have. My concern is that this alone already
> takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> various types, and takes 15MB on disk.
>
> import org.apache.hadoop.conf.Configuration
> import org.apache.parquet.column.ColumnDescriptor
> import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> import org.apache.parquet.format.converter.ParquetMetadataConverter
> import org.apache.parquet.hadoop.{ParquetFileReader}
> import org.apache.parquet.io.ColumnIOFactory
>
> ...
>
> val path: Path = Paths.get("C:\\item.pq")
> val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> val conf = new Configuration()
>
> val readFooter = ParquetFileReader.readFooter(conf, jpath,
> ParquetMetadataConverter.NO_FILTER)
> val schema = readFooter.getFileMetaData.getSchema
> val r = ParquetFileReader.open(conf, jpath)
>
> val pages = r.readNextRowGroup()
> val rows = pages.getRowCount
>
> val columnIO = new ColumnIOFactory().getColumnIO(schema)
> val recordReader = columnIO.getRecordReader(pages, new
> GroupRecordConverter(schema))
>
> // This takes about 2s
> (1 to rows.toInt).map { i =>
>   val group = recordReader.read
>   // Just read first column for now ...
>   val x = group.getLong(0,0)
> }
>
> ...
>
> As this will be in the hot path of my code, I'm quite keen to make it
> as fast as possible. Note that the eventual objective is to build
> Arrow data. I was assuming there would be a way to quickly load the
> columns. I suspect the loop over the rows, building row-based records,
> is causing a lot of overhead, but can't seem to find another way.
>
>
> Thanks,
>
> -J

Re: Java/Scala: efficient reading of Parquet into Arrow?

Reply via email to