[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780735#action_12780735
 ] 

Yan Zhou commented on PIG-1077:
-------------------------------

This pacth is also targeted for the 0.6 release so it needs to be on the 0.6 
branch too.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.7.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to