[ https://issues.apache.org/jira/browse/PIG-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805195#action_12805195 ]
Yan Zhou commented on PIG-1201: ------------------------------- HDFS listStatus calls by every mapper to the name node is costly, particularly if the target has huge number of disk entries, i.e., files and directories. Zebra has the problem in a couple of ways: 1) for unsorted tables, the index is not built on disk. The input split which is a tfile row split has file index that needs to be mapped to the file name using the index, which contains file names in order and their sizes, by each and every mapper. Building the index makes the listStatus call as it needs info of all files. And if the number of files are huge, this caused name node resource cramps. Instead, the file index can be well replaced with the file name so that the mapping, and consequently the index, is not needed at all for the routine ops like queries against the tables. For other informational requests like dumpInfo where a comprehensive picture is required, the index could be built as needed. The on-disk index is still preferred as it will save one listStatus call by the front end. But it would require more changes to support backward compatibility and the meta file that holds the index does not support versioning. Consequently, this work is deferred to a future release, although the on-disk index will be built for future convinience; 2) Each BasicTable.Reader, at construction, will check and mark all deleted CGs in the SchemaFile.setCGDeletedFlags method, which makes the listStatus call. This may not be as bad as the one in 1), but for the tables with lots of CGs, it could present a problem. Instead, the check can only be made by a front end and passed to mappers the info. The huge JobConf serialization size in Pig loader implementation will be fixed by only serializing the few configuration variables that Zebra need. > [zebra] HDFS meta queries are issued by all mappers; Pig Loader serialize all > JobConf contents including those unused by zebra > ------------------------------------------------------------------------------------------------------------------------------ > > Key: PIG-1201 > URL: https://issues.apache.org/jira/browse/PIG-1201 > Project: Pig > Issue Type: Bug > Affects Versions: 0.6.0 > Reporter: Yan Zhou > Assignee: Yan Zhou > Priority: Minor > Fix For: 0.6.0, 0.7.0 > > Attachments: PIG-1201.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.