Re: best way to copy all files from a file system to hdfs

2009-02-01 Thread Philip (flip) Kromer
Could you tar.bz2 them up (setting up the tar so that it made a few dozen
files), toss them onto the HDFS, and use
http://stuartsierra.com/2008/04/24/a-million-little-files
to go into SequenceFile?

This lets you preserve the originals and do the sequence file conversion
across the cluster. It's only really helpful, of course, if you also want to
prepare a .tar.bz2 so you can clear out the sprawl

flip

On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner  wrote:

> Hi,
>
> I am writing an application to copy all files from a regular PC to a
> SequenceFile. I can surely do this by simply recursing all directories on
> my
> PC, but I wonder if there is any way to parallelize this, a MapReduce task
> even. Tom White's books seems to imply that it will have to be a custom
> application.
>
> Thank you,
> Mark
>



-- 
http://www.infochimps.org
Connected Open Free Data


Re: HDFS - millions of files in one directory?

2009-01-27 Thread Philip (flip) Kromer
Tossing one more on this king of all threads:
Stuart Sierra of AltLaw wrote a nice little tool to serialize tar.bz2 files
into SequenceFile, with filename as key and its contents a BLOCK-compressed
blob.
  http://stuartsierra.com/2008/04/24/a-million-little-files

flip


On Mon, Jan 26, 2009 at 3:20 PM, Mark Kerzner  wrote:

> Jason, this is awesome, thank you.
> By the way, is there a book or manual with "best practices?"
>
> On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop  >wrote:
>
> > Sequence files rock, and you can use the
> > *
> > bin/hadoop dfs -text FILENAME* command line tool to get a toString level
> > unpacking of the sequence file key,value pairs.
> >
> > If you provide your own key or value classes, you will need to implement
> a
> > toString method to get some use out of this. Also, your class path will
> > need
> > to include the jars with your custom key/value classes.
> >
> > HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*
> >
> >
> > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner 
> > wrote:
> >
> > > Thank you, Doug, then all is clear in my head.
> > > Mark
> > >
> > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting 
> > wrote:
> > >
> > > > Mark Kerzner wrote:
> > > >
> > > >> Okay, I am convinced. I only noticed that Doug, the originator, was
> > not
> > > >> happy about it - but in open source one has to give up control
> > > sometimes.
> > > >>
> > > >
> > > > I think perhaps you misunderstood my remarks.  My point was that, if
> > you
> > > > looked to Nutch's Content class for an example, it is, for historical
> > > > reasons, somewhat more complicated than it needs to be and is thus a
> > less
> > > > than perfect example.  But using SequenceFile to store web content is
> > > > certainly a best practice and I did not mean to imply otherwise.
> > > >
> > > > Doug
> > > >
> > >
> >
>



-- 
http://www.infochimps.org
Connected Open Free Data


Re: HDFS - millions of files in one directory?

2009-01-24 Thread Philip (flip) Kromer
I think that Google developed
BigTable<http://en.wikipedia.org/wiki/BigTable> to
solve this; hadoop's HBase, or any of the myriad other distributed/document
databases should work depending on need:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
  http://www.mail-archive.com/core-user@hadoop.apache.org/msg07011.html

Heretrix <http://en.wikipedia.org/wiki/Heritrix>,
Nutch<http://en.wikipedia.org/wiki/Nutch>,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
These of course are industrial strength tools (and many of their authors are
here in the room with us :) The only question with those tools is whether
their might exceeds your needs.

There's some oddball project out there that does peer-to-peer something
something scraping but I can't find it anywhere in my bookmarks. I don't
recall whether they're file-backed or DB-backed.

If you, like us, want something more modest and targeted there is the
recently-released python-toolkit
  http://lucasmanual.com/mywiki/DataHub
I haven't looked at it to see if they've used it at scale.

We infochimps are working right now to clean up and organize for initial
release our own Infinite Monkeywrench, a homely but effective toolkit for
gathering and munging datasets.  (Those stupid little one-off scripts you
write and throw away? A Ruby toolkit to make them slightly less annoying.)
We frequently use it for directed scraping of APIs and websites.  If you're
willing to deal with pre-release code that's never strayed far from the
machines of the guys what wrote it I can point you to what we have.

I think I was probably too tough on bundling into files. If things are
immutable, and only treated in bulk, and are easily and reversibly
serialized, bundling many documents into a file is probably good. As I said,
our toolkit uses flat text files, with the advantages of simplicity and the
downside of ad hoc-ness. Storing into the ARC format lets you use the tools
in the Big Scraper ecosystem, but obvs. you'd need to convert out to use
with other things, possibly returning you to this same question.

If you need to grab arbitrary subsets of the data, and the one set of
locality tradeoffs is better than the other set of locality tradeoffs, or
you need better metadata management than bundled-into-file gives you then I
think that's why those distributed/document-type databases got invented.

flip

On Sat, Jan 24, 2009 at 7:21 PM, Mark Kerzner  wrote:

> Philip,
>
> it seems like you went through the same problems as I did, and confirmed my
> feeling that this is not a trivial problem. My first idea was to balance
> the
> directory tree somehow and to store the remaining metadata elsewhere, but
> as
> you say, it has limitations. I could use some solution like your specific
> one, but I am only surprised that this problem does not have a well-known
> solution, or solutions. Again, how does Google or Yahoo store the files
> that
> they have crawled? MapReduce paper says that they store them all first,
> that
> is a few billion pages. How do they do it?
>
> Raghu,
>
> if I write all files only one, is the cost the same in one directory or do
> I
> need to find the optimal directory size and when full start another
> "bucket?"
>
> Thank you,
> Mark
>
> On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer
> wrote:
>
> > I ran in this problem, hard, and I can vouch that this is not a
> > windows-only
> > problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
> > than a few hundred thousand files in the same directory. (The operation
> to
> > correct this mistake took a week to run.)  That is one of several hard
> > lessons I learned about "don't write your scraper to replicate the path
> > structure of each document as a file on disk."
> >
> > Cascading the directory structure works, but sucks in various other ways,
> > and itself stops scaling after a while.  What I eventually realized is
> that
> > I was using the filesystem as a particularly wrongheaded document
> database,
> > and that the metadata delivery of a filesystem just doesn't work for
> this.
> >
> > Since in our application the files are text and are immutable, our adhoc
> > solution is to encode and serialize each file with all its metadata, one
> > per
> > line, into a flat file.
> >
> > A distributed database is probably the correct answer, but this is
> working
> > quite well for now and even has some advantages. (No-cost replication
> from
> > work to home or offline by rsync or thumb drive, for example.)
> >
> > flip
&g

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Philip (flip) Kromer
I ran in this problem, hard, and I can vouch that this is not a windows-only
problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
than a few hundred thousand files in the same directory. (The operation to
correct this mistake took a week to run.)  That is one of several hard
lessons I learned about "don't write your scraper to replicate the path
structure of each document as a file on disk."

Cascading the directory structure works, but sucks in various other ways,
and itself stops scaling after a while.  What I eventually realized is that
I was using the filesystem as a particularly wrongheaded document database,
and that the metadata delivery of a filesystem just doesn't work for this.

Since in our application the files are text and are immutable, our adhoc
solution is to encode and serialize each file with all its metadata, one per
line, into a flat file.

A distributed database is probably the correct answer, but this is working
quite well for now and even has some advantages. (No-cost replication from
work to home or offline by rsync or thumb drive, for example.)

flip

On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi  wrote:

> Mark Kerzner wrote:
>
>> But it would seem then that making a balanced directory tree would not
>> help
>> either - because there would be another binary search, correct? I assume,
>> either way it would be as fast as can be :)
>>
>
> But the cost of memory copies would be much less with a tree (when you add
> and delete files).
>
> Raghu.
>
>
>
>>
>> On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi 
>> wrote:
>>
>>  If you are adding and deleting files in the directory, you might notice
>>> CPU
>>> penalty (for many loads, higher CPU on NN is not an issue). This is
>>> mainly
>>> because HDFS does a binary search on files in a directory each time it
>>> inserts a new file.
>>>
>>> If the directory is relatively idle, then there is no penalty.
>>>
>>> Raghu.
>>>
>>>
>>> Mark Kerzner wrote:
>>>
>>>  Hi,

 there is a performance penalty in Windows (pardon the expression) if you
 put
 too many files in the same directory. The OS becomes very slow, stops
 seeing
 them, and lies about their status to my Java requests. I do not know if
 this
 is also a problem in Linux, but in HDFS - do I need to balance a
 directory
 tree if I want to store millions of files, or can I put them all in the
 same
 directory?

 Thank you,
 Mark



>>
>


-- 
http://www.infochimps.org
Connected Open Free Data


Distributed Key-Value Databases

2009-01-19 Thread Philip (flip) Kromer
Hey y'all,

There've been a few questions about distributed database solutions (a
partial list: HBase, Voldemort, Memcached, ThruDB, CouchDB, Ringo, Scalaris,
Kai, Dynomite, Cassandra, Hypertable, as well as the closed Dynamo,
BigTable, SimpleDB).

For someone using Hadoop at scale, what problem aspects would recommend one
of those over another?
And in your subjective judgement, do any of these seem especially likely to
succeed?

Richard Jones of Last.fm just posted an overview with a great deal of
engineering insight:

http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
His focus is a production web server farm, and so in some ways orthogonal to
the crowd here -- but still highly recommended.  Swaroop CH of Yahoo wrote a
broad introduction to distributed DBs I also found useful:
  http://www.swaroopch.com/notes/Distributed_Storage_Systems

Both give HBase short shrift, though my impression is that it is the leader
among open projects for massive unordered dataset problems. The answer also,
though, doesn't seem to be a simple "If you're using Hadoop you should be
using HBase, dummy."

I don't have the expertise to write this kind of overview from the hadoop /
big data perspective but would eagerly read such an article from someone who
does, or to summarize the insights of the list.

===

In lieu yet of such a summary, pointers to a few relevant threads:
*
http://www.nabble.com/Why-is-scaling-HBase-much-simpler-then-scaling-a-relational-db--tt18869660.html#a19093685

  (especially Jonathan Gray's breakdown)
* "HBase Performance"
http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02540.html
  (and the paper by Stonebraker and friends:
http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf)
*
http://www.nabble.com/Serving-contents-of-large-MapFiles-SequenceFiles-from-memory-across-many-machines-tt19546012.html#a19574917
* On specific problem domains:
  http://www.nabble.com/Indexed-Hashtables-tt21470024.html#a21470848

http://www.nabble.com/Why-can%27t-Hadoop-be-used-for-online-applications---tt19461962.html#a19471894
  http://www.nabble.com/Architecture-question.-tt21100766.html#a21100766

flip

(noted in passing: a huge proportion of the development seems to be coming
out of commercial enterprises and not the academic/HPC community. I worry my
ivory tower is hung up on big iron and the top500.org list, at the expense
of solving the many interesting problems these unlock.)
-- 
http://www.infochimps.org
Connected Open Free Data


@hadoop on twitter

2009-01-13 Thread Philip (flip) Kromer
Hey all,
There is no @hadoop on twitter, but there should be.
http://twitter.com/datamapper and http://twitter.com/rails both set good
examples.

I'd be glad to either help get that going or to nod approvingly if someone
on core does so.

flip


Re: General questions about Map-Reduce

2009-01-12 Thread Philip (flip) Kromer
On Sun, Jan 11, 2009 at 9:05 PM, tienduc_dinh wrote:

> Is there any article which describes it ?
>

There's also Tom White's in-progress "Hadoop: The Definitive Guide":
http://my.safaribooksonline.com/9780596521974

flip
-- 
http://www.infochimps.org
Connected Open Free Data


KeyFieldBasedPartitioner fiddling with backslashed values

2008-12-15 Thread Philip (flip) Kromer
I'm having a weird issue.

When I invoke my mapreduce with a secondary sort using
the KeyFieldBasedPartitioner, it's altering lines containing backslashes.
 Or I've made some foolish conceptual error and my script is doing so, but
only when there's a partitioner.  Any advice welcome.  I've attached the
script and a bowdlerized copy of the output, since I fear the worst for the
formatting on the text below.

With no partitioner, among a few million other million lines, my script
produces this one correctly:

=
twitter_user_profile twitter_user_profile-018421-20081205-184526
018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=


( was called using: )


hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
-mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
-reducer 
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
-input  rawd/keyed/_20081205'/user-keyed.tsv' \
-output  out/"parsed-$output_id"


Note that the website field contained
  http://http:\\www.MyWebsitee.com
(this person clearly either fails at internet or wins at windows)

When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on
these few lines with backslashes, generating instead a single backslash
followed by a tab:


=
twitter_user_profile twitter_user_profile-018421-20081205-184526
018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=


( was called using: )

hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconfmap.output.key.field.separator='\t' \
-jobconfnum.key.fields.for.partition=1 \
-jobconf stream.map.output.field.separator='\t' \
-jobconf stream.num.map.output.key.fields=2 \
-mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
-reducer 
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
-input  rawd/keyed/_20081205'/user-keyed.tsv' \
-output  out/"parsed-$output_id"


When I run the script on the command line
  cat input | hadoop_parse_json.rb | sort -k1,2
| hadoop_uniq_without_timestamp.rb
everything works as I'd like.

I've hunted through the JIRA and found nothing.
If this sounds like a problem with hadoop I'll try to isolate a proper test
case.

Thanks for any advice,
flip

The output of my script with no secondary sort produces, among a few million 
others, this line correctly:

=
twitter_user_profiletwitter_user_profile-018421-20081205-184526 
018421  M...e   http://http:\\www.MyWebsitee.comS, NJ   I... 
notice.Eastern Time (US & Canada)  -18000  20081205-184526
=

When I use a KeyFieldBasedPartitioner, it reaches in and diddles lines with 
backslashes:

=
twitter_user_profiletwitter_user_profile-018421-20081205-184526 
018421  M...e   http://http:\   www.MyWebsitee.com  S, NJ   I... 
notice.Eastern Time (US & Canada)  -18000  20081205-184526
=

===
==
== Script, with partitioner
==

#!/usr/bin/env bash
input_id=$1
output_id=$2
hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar 
\
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner  
\
-jobconfmap.output.key.field.separator='\t' 
\
-jobconfnum.key.fields.for.partition=1  
\
-jobconfstream.map.output.field.separator='\t'  
\
-jobconfstream.num.map.output.key.fields=2  
\
-mapper 
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
-reducer
/home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
 \
-input  rawd/keyed/_20081205'/user-keyed.tsv' \
-output  out/"parsed-$output_id"
\
-filehadoop_utils.rb
\
-filetwitter_flat_model.rb  
\
-filetwitter_autourl.rb

== Excerpt of output.  Everything is correct except the url field

twitter_user_profiletwitter_user_profile-018441-20081205-024904 
018441  G..er   http://www.l... D...O fun...:-) 
20081205-024904
twitter_user_profiletwitter_user_profile-018441-20081205-084448 
018441  S...e   Eastern Time (US & Canada)  
-18000  20081205-084448 
twitter_user_profiletwitter_user_profile-018421-20081205-184526 
018421  M...e   http://