Re: best way to copy all files from a file system to hdfs
Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner wrote: > Hi, > > I am writing an application to copy all files from a regular PC to a > SequenceFile. I can surely do this by simply recursing all directories on > my > PC, but I wonder if there is any way to parallelize this, a MapReduce task > even. Tom White's books seems to imply that it will have to be a custom > application. > > Thank you, > Mark > -- http://www.infochimps.org Connected Open Free Data
Re: HDFS - millions of files in one directory?
Tossing one more on this king of all threads: Stuart Sierra of AltLaw wrote a nice little tool to serialize tar.bz2 files into SequenceFile, with filename as key and its contents a BLOCK-compressed blob. http://stuartsierra.com/2008/04/24/a-million-little-files flip On Mon, Jan 26, 2009 at 3:20 PM, Mark Kerzner wrote: > Jason, this is awesome, thank you. > By the way, is there a book or manual with "best practices?" > > On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop >wrote: > > > Sequence files rock, and you can use the > > * > > bin/hadoop dfs -text FILENAME* command line tool to get a toString level > > unpacking of the sequence file key,value pairs. > > > > If you provide your own key or value classes, you will need to implement > a > > toString method to get some use out of this. Also, your class path will > > need > > to include the jars with your custom key/value classes. > > > > HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME* > > > > > > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner > > wrote: > > > > > Thank you, Doug, then all is clear in my head. > > > Mark > > > > > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting > > wrote: > > > > > > > Mark Kerzner wrote: > > > > > > > >> Okay, I am convinced. I only noticed that Doug, the originator, was > > not > > > >> happy about it - but in open source one has to give up control > > > sometimes. > > > >> > > > > > > > > I think perhaps you misunderstood my remarks. My point was that, if > > you > > > > looked to Nutch's Content class for an example, it is, for historical > > > > reasons, somewhat more complicated than it needs to be and is thus a > > less > > > > than perfect example. But using SequenceFile to store web content is > > > > certainly a best practice and I did not mean to imply otherwise. > > > > > > > > Doug > > > > > > > > > > -- http://www.infochimps.org Connected Open Free Data
Re: HDFS - millions of files in one directory?
I think that Google developed BigTable<http://en.wikipedia.org/wiki/BigTable> to solve this; hadoop's HBase, or any of the myriad other distributed/document databases should work depending on need: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ http://www.mail-archive.com/core-user@hadoop.apache.org/msg07011.html Heretrix <http://en.wikipedia.org/wiki/Heritrix>, Nutch<http://en.wikipedia.org/wiki/Nutch>, others use the ARC file format http://www.archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml These of course are industrial strength tools (and many of their authors are here in the room with us :) The only question with those tools is whether their might exceeds your needs. There's some oddball project out there that does peer-to-peer something something scraping but I can't find it anywhere in my bookmarks. I don't recall whether they're file-backed or DB-backed. If you, like us, want something more modest and targeted there is the recently-released python-toolkit http://lucasmanual.com/mywiki/DataHub I haven't looked at it to see if they've used it at scale. We infochimps are working right now to clean up and organize for initial release our own Infinite Monkeywrench, a homely but effective toolkit for gathering and munging datasets. (Those stupid little one-off scripts you write and throw away? A Ruby toolkit to make them slightly less annoying.) We frequently use it for directed scraping of APIs and websites. If you're willing to deal with pre-release code that's never strayed far from the machines of the guys what wrote it I can point you to what we have. I think I was probably too tough on bundling into files. If things are immutable, and only treated in bulk, and are easily and reversibly serialized, bundling many documents into a file is probably good. As I said, our toolkit uses flat text files, with the advantages of simplicity and the downside of ad hoc-ness. Storing into the ARC format lets you use the tools in the Big Scraper ecosystem, but obvs. you'd need to convert out to use with other things, possibly returning you to this same question. If you need to grab arbitrary subsets of the data, and the one set of locality tradeoffs is better than the other set of locality tradeoffs, or you need better metadata management than bundled-into-file gives you then I think that's why those distributed/document-type databases got invented. flip On Sat, Jan 24, 2009 at 7:21 PM, Mark Kerzner wrote: > Philip, > > it seems like you went through the same problems as I did, and confirmed my > feeling that this is not a trivial problem. My first idea was to balance > the > directory tree somehow and to store the remaining metadata elsewhere, but > as > you say, it has limitations. I could use some solution like your specific > one, but I am only surprised that this problem does not have a well-known > solution, or solutions. Again, how does Google or Yahoo store the files > that > they have crawled? MapReduce paper says that they store them all first, > that > is a few billion pages. How do they do it? > > Raghu, > > if I write all files only one, is the cost the same in one directory or do > I > need to find the optimal directory size and when full start another > "bucket?" > > Thank you, > Mark > > On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer > wrote: > > > I ran in this problem, hard, and I can vouch that this is not a > > windows-only > > problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more > > than a few hundred thousand files in the same directory. (The operation > to > > correct this mistake took a week to run.) That is one of several hard > > lessons I learned about "don't write your scraper to replicate the path > > structure of each document as a file on disk." > > > > Cascading the directory structure works, but sucks in various other ways, > > and itself stops scaling after a while. What I eventually realized is > that > > I was using the filesystem as a particularly wrongheaded document > database, > > and that the metadata delivery of a filesystem just doesn't work for > this. > > > > Since in our application the files are text and are immutable, our adhoc > > solution is to encode and serialize each file with all its metadata, one > > per > > line, into a flat file. > > > > A distributed database is probably the correct answer, but this is > working > > quite well for now and even has some advantages. (No-cost replication > from > > work to home or offline by rsync or thumb drive, for example.) > > > > flip &g
Re: HDFS - millions of files in one directory?
I ran in this problem, hard, and I can vouch that this is not a windows-only problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more than a few hundred thousand files in the same directory. (The operation to correct this mistake took a week to run.) That is one of several hard lessons I learned about "don't write your scraper to replicate the path structure of each document as a file on disk." Cascading the directory structure works, but sucks in various other ways, and itself stops scaling after a while. What I eventually realized is that I was using the filesystem as a particularly wrongheaded document database, and that the metadata delivery of a filesystem just doesn't work for this. Since in our application the files are text and are immutable, our adhoc solution is to encode and serialize each file with all its metadata, one per line, into a flat file. A distributed database is probably the correct answer, but this is working quite well for now and even has some advantages. (No-cost replication from work to home or offline by rsync or thumb drive, for example.) flip On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi wrote: > Mark Kerzner wrote: > >> But it would seem then that making a balanced directory tree would not >> help >> either - because there would be another binary search, correct? I assume, >> either way it would be as fast as can be :) >> > > But the cost of memory copies would be much less with a tree (when you add > and delete files). > > Raghu. > > > >> >> On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi >> wrote: >> >> If you are adding and deleting files in the directory, you might notice >>> CPU >>> penalty (for many loads, higher CPU on NN is not an issue). This is >>> mainly >>> because HDFS does a binary search on files in a directory each time it >>> inserts a new file. >>> >>> If the directory is relatively idle, then there is no penalty. >>> >>> Raghu. >>> >>> >>> Mark Kerzner wrote: >>> >>> Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark >> > -- http://www.infochimps.org Connected Open Free Data
Distributed Key-Value Databases
Hey y'all, There've been a few questions about distributed database solutions (a partial list: HBase, Voldemort, Memcached, ThruDB, CouchDB, Ringo, Scalaris, Kai, Dynomite, Cassandra, Hypertable, as well as the closed Dynamo, BigTable, SimpleDB). For someone using Hadoop at scale, what problem aspects would recommend one of those over another? And in your subjective judgement, do any of these seem especially likely to succeed? Richard Jones of Last.fm just posted an overview with a great deal of engineering insight: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ His focus is a production web server farm, and so in some ways orthogonal to the crowd here -- but still highly recommended. Swaroop CH of Yahoo wrote a broad introduction to distributed DBs I also found useful: http://www.swaroopch.com/notes/Distributed_Storage_Systems Both give HBase short shrift, though my impression is that it is the leader among open projects for massive unordered dataset problems. The answer also, though, doesn't seem to be a simple "If you're using Hadoop you should be using HBase, dummy." I don't have the expertise to write this kind of overview from the hadoop / big data perspective but would eagerly read such an article from someone who does, or to summarize the insights of the list. === In lieu yet of such a summary, pointers to a few relevant threads: * http://www.nabble.com/Why-is-scaling-HBase-much-simpler-then-scaling-a-relational-db--tt18869660.html#a19093685 (especially Jonathan Gray's breakdown) * "HBase Performance" http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02540.html (and the paper by Stonebraker and friends: http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf) * http://www.nabble.com/Serving-contents-of-large-MapFiles-SequenceFiles-from-memory-across-many-machines-tt19546012.html#a19574917 * On specific problem domains: http://www.nabble.com/Indexed-Hashtables-tt21470024.html#a21470848 http://www.nabble.com/Why-can%27t-Hadoop-be-used-for-online-applications---tt19461962.html#a19471894 http://www.nabble.com/Architecture-question.-tt21100766.html#a21100766 flip (noted in passing: a huge proportion of the development seems to be coming out of commercial enterprises and not the academic/HPC community. I worry my ivory tower is hung up on big iron and the top500.org list, at the expense of solving the many interesting problems these unlock.) -- http://www.infochimps.org Connected Open Free Data
@hadoop on twitter
Hey all, There is no @hadoop on twitter, but there should be. http://twitter.com/datamapper and http://twitter.com/rails both set good examples. I'd be glad to either help get that going or to nod approvingly if someone on core does so. flip
Re: General questions about Map-Reduce
On Sun, Jan 11, 2009 at 9:05 PM, tienduc_dinh wrote: > Is there any article which describes it ? > There's also Tom White's in-progress "Hadoop: The Definitive Guide": http://my.safaribooksonline.com/9780596521974 flip -- http://www.infochimps.org Connected Open Free Data
KeyFieldBasedPartitioner fiddling with backslashed values
I'm having a weird issue. When I invoke my mapreduce with a secondary sort using the KeyFieldBasedPartitioner, it's altering lines containing backslashes. Or I've made some foolish conceptual error and my script is doing so, but only when there's a partitioner. Any advice welcome. I've attached the script and a bowdlerized copy of the output, since I fear the worst for the formatting on the text below. With no partitioner, among a few million other million lines, my script produces this one correctly: = twitter_user_profile twitter_user_profile-018421-20081205-184526 018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 = ( was called using: ) hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" Note that the website field contained http://http:\\www.MyWebsitee.com (this person clearly either fails at internet or wins at windows) When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on these few lines with backslashes, generating instead a single backslash followed by a tab: = twitter_user_profile twitter_user_profile-018421-20081205-184526 018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 = ( was called using: ) hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -jobconfmap.output.key.field.separator='\t' \ -jobconfnum.key.fields.for.partition=1 \ -jobconf stream.map.output.field.separator='\t' \ -jobconf stream.num.map.output.key.fields=2 \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" When I run the script on the command line cat input | hadoop_parse_json.rb | sort -k1,2 | hadoop_uniq_without_timestamp.rb everything works as I'd like. I've hunted through the JIRA and found nothing. If this sounds like a problem with hadoop I'll try to isolate a proper test case. Thanks for any advice, flip The output of my script with no secondary sort produces, among a few million others, this line correctly: = twitter_user_profiletwitter_user_profile-018421-20081205-184526 018421 M...e http://http:\\www.MyWebsitee.comS, NJ I... notice.Eastern Time (US & Canada) -18000 20081205-184526 = When I use a KeyFieldBasedPartitioner, it reaches in and diddles lines with backslashes: = twitter_user_profiletwitter_user_profile-018421-20081205-184526 018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice.Eastern Time (US & Canada) -18000 20081205-184526 = === == == Script, with partitioner == #!/usr/bin/env bash input_id=$1 output_id=$2 hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -jobconfmap.output.key.field.separator='\t' \ -jobconfnum.key.fields.for.partition=1 \ -jobconfstream.map.output.field.separator='\t' \ -jobconfstream.num.map.output.key.fields=2 \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" \ -filehadoop_utils.rb \ -filetwitter_flat_model.rb \ -filetwitter_autourl.rb == Excerpt of output. Everything is correct except the url field twitter_user_profiletwitter_user_profile-018441-20081205-024904 018441 G..er http://www.l... D...O fun...:-) 20081205-024904 twitter_user_profiletwitter_user_profile-018441-20081205-084448 018441 S...e Eastern Time (US & Canada) -18000 20081205-084448 twitter_user_profiletwitter_user_profile-018421-20081205-184526 018421 M...e http://