[
https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576227#action_12576227
]
Benjamin Reed commented on PIG-102:
-----------------------------------
Yes it can. We open the file in PigInputFormat, so we can get the file from
whereever we want.
I really like this proposal. My only comment would be that it might be nice to
use shared:/path for files you don't want loaded into Hadoop rather than using
a configuration file to mark the shared directories. The motivation for this
has to do with what Craig pointed out: if you have 1000 machines accessing the
shared file, you might want to still copy it to HDFS. That scenario is
dependent upon the number of machines, not the directory. For example, you may
have a job doing a join against a dataset in the NFS directory
/nfs/Top10MillionPhrases. If your first job uses only 20 machines to join
against a rather small dataset, you would probably used
shared:/nfs/Top10MillionPhrases. On the other hand if you were joining with a
10T dataset, you would probably use file:/nfs/Top10MillionPhrases so that it
gets copied to HDFS.
> Dont copy to DFS if source filesystem marked as shared
> ------------------------------------------------------
>
> Key: PIG-102
> URL: https://issues.apache.org/jira/browse/PIG-102
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Environment: Installations with shared folders on all nodes (eg NFS)
> Reporter: Craig Macdonald
>
> I've been playing with Pig using three setups:
> (a) local
> (b) hadoop mapred with hdfs
> (c) hadoop mapred with file:///path/to/shared/fs as the default file system
> In our local setup, various NFS filesystems are shared between all machines
> (including mapred nodes) eg /users, /local
> I would like Pig to note when input files are in a file:// directory that has
> been marked as shared, and hence not copy it to DFS.
> Similarly, the Torque PBS resource manager has a usecp directive, which notes
> when a filesystem location is shared between all nodes, (and hence scp is not
> needed, cp alone can be used). See
> http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
> It would be good to have a configurable setting in Pig that says when a
> filesystem is shared, and hence no copying between file:// and hdfs:// is
> needed.
> An example in our setup might be:
> sharedFS file:///local/
> sharedFS file:///users/
> if commands should be used.
> This command should be used with care. Obviously if you have 1000 nodes all
> accessing a shared file in NFS, then it would have been better to "hadoopify"
> the file.
> The likely area of code to patch is
> src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.