Notes:
Attendance
Julien, Alex: Twitter
Daniel, Tongjie: Netflix
Parth, Jason: MapR, Apache Drill
Ryan: Cloudera.
Mechanisms to detect/deal with Corruption when bad hardware.
- add mechanism to detect bad write at write time?.
- 4 jiras to create:
- print of how many records are left in a row group when problem
reading it.
- option to skip end of row group (when we want to read what we can
from a corrupted file)
- crc gen and verification to have better error message and narrow down
the problem
- option to reread file on close to verify it is not corrupted
(expansive).
- new page format will enable more granular recovery from a corrupted file.
PARQUET-100:
- no need for metadata in InputFormat client side. Only for Pig
- will work on making a PigLoader that does not require reading schema
from footers.
ParquetFileInputFormat:
- should be a feature of the existing input format.
release 1.6.0:
- add new ParquetFileInputFormat in release.
- need to do PARQUET-111
vectorized execution:
- Drill and Netflix have slightly different goals.
- Drill will submit their vectorized reader as a pull request.
- Zengxiao from Netflix works on lazy load for vectorized filter
evaluation in Presto. (avoid decoding pages when not necessary)
memory manager for dynamic partitions in Hive
- Hive-13 parameter for sorting before dynamic partitioning to have one
writer at once.
=> Hive-6455
- memory manager in parquet almost ready.
Blog posts from Cloudera, MapR, Twitter, Netflix.
next meeting January 5th
On Mon, Dec 8, 2014 at 10:27 AM, Julien Le Dem <[email protected]> wrote:
> It is happening at 10:30 am PST (in 5 min) on google hangout.
> Google hangout has a maximum of 10 connections.
> Please share the connection if you can to allow more people to join.
> https://plus.google.com/events/cvkgu217llltujv5siddodk2oa0
>