Re: Parquet sync up

Julien Le Dem Mon, 08 Dec 2014 12:05:56 -0800

Notes:
Attendance
Julien, Alex: Twitter
Daniel, Tongjie: Netflix
Parth, Jason: MapR, Apache Drill
Ryan: Cloudera.


Mechanisms to detect/deal with Corruption when bad hardware.
 - add mechanism to detect bad write at write time?.
 - 4 jiras to create:
    - print of how many records are left in a row group when problem
reading it.
    - option to skip end of row group (when we want to read what we can
from a corrupted file)
    - crc gen and verification to have better error message and narrow down
the problem
    - option to reread file on close to verify it is not corrupted
(expansive).
 - new page format will enable more granular recovery from a corrupted file.

PARQUET-100:
 - no need for metadata in InputFormat client side. Only for Pig
 - will work on making a PigLoader that does not require reading schema
from footers.

ParquetFileInputFormat:
  - should be a feature of the existing input format.

release 1.6.0:
  - add new ParquetFileInputFormat in release.
  - need to do PARQUET-111

vectorized execution:
 - Drill and Netflix have slightly different goals.
 - Drill will submit their vectorized reader as a pull request.
 - Zengxiao from Netflix works on lazy load for vectorized filter
evaluation in Presto. (avoid decoding pages when not necessary)

memory manager for dynamic partitions in Hive
 - Hive-13 parameter for sorting before dynamic partitioning to have one
writer at once.
  => Hive-6455
 - memory manager in parquet almost ready.

Blog posts from Cloudera, MapR, Twitter, Netflix.

next meeting January 5th

On Mon, Dec 8, 2014 at 10:27 AM, Julien Le Dem <[email protected]> wrote:

> It is happening at 10:30 am PST (in 5 min) on google hangout.
> Google hangout has a maximum of 10 connections.
> Please share the connection if you can to allow more people to join.
> https://plus.google.com/events/cvkgu217llltujv5siddodk2oa0
>

Re: Parquet sync up

Reply via email to