[ 
https://issues.apache.org/jira/browse/PIG-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805195#action_12805195
 ] 

Yan Zhou commented on PIG-1201:
-------------------------------

HDFS listStatus calls by every mapper to the name node is costly, particularly 
if the target has huge number of disk entries, i.e., files and directories. 
Zebra has the problem in a couple of ways:

1) for unsorted tables,  the index is not built on disk. The input split which 
is a tfile row split has file index that needs to be mapped to the file name 
using the index, which contains file names in order and their sizes, by each 
and every mapper. Building the index makes the listStatus call as it needs info 
of all files. And if the number of files are huge, this caused name node 
resource cramps. Instead, the file index can be well replaced with the file 
name so that the mapping, and consequently the index,  is not needed at all for 
the routine ops like queries against the tables. For other informational 
requests like dumpInfo where a comprehensive picture is required, the index 
could be built as needed. The on-disk index is still preferred as it will save 
one listStatus call by the front end. But it would require more changes to 
support backward compatibility and the meta file that holds the index does not 
support versioning. Consequently, this work is deferred to a future release, 
although the on-disk index will be built for future convinience;

2) Each BasicTable.Reader, at construction, will check and mark all deleted CGs 
in the SchemaFile.setCGDeletedFlags method, which makes the listStatus call. 
This may not be as bad as the one in 1), but for the tables with lots of CGs, 
it could present a problem. Instead, the check can only be made by a front end 
and passed to mappers the info.

The huge JobConf serialization size in Pig loader implementation will be fixed 
by only serializing the few configuration variables that Zebra need. 

> [zebra] HDFS meta queries are issued by all mappers; Pig Loader serialize all 
> JobConf contents including those unused by zebra
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1201
>                 URL: https://issues.apache.org/jira/browse/PIG-1201
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: PIG-1201.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to