Hey Drew,

Let me know when you post a JIRA/any help you might want?

On Fri, Jan 8, 2010 at 5:03 PM, Drew Farris <[email protected]> wrote:

> Jake, thanks for the review, running narrative and comments. The
> Analyzer in use should be up to the user, so there will be flexibility
> to mess around with lots of alternative there, but it will be nice to
> provide reasonable defaults and include this sort of discussion in the
> wiki page for the algo. I'll finish up the rest of the code for it and
> post a patch to JIRA.
>
> Robin, I'll take a look at the dictionaryVectorizer, and see how they
> can work together. I think something like SequenceFiles<documentId,
> Text or BytesWritable> make sense as input for this job and it's
> probably easier to work with than what I had to whip up to slurp in
> files whole.
>
> Does anyone know if there is a stream based alternative to Text or
> BytesWritable?
>
> On Thu, Jan 7, 2010 at 11:46 PM, Jake Mannix <[email protected]>
> wrote:
> > Ok, I lied - I think what you described here is way *faster* than what I
> > was doing, because I wasn't starting with the original corpus, I had
> > something like google's ngram terabyte data (a massive HDFS file with
> > just "ngram ngram-frequency" on each line), which mean I had to do
> > a multi-way join (which is where I needed to do a secondary sort by
> > value).
> >
> > Starting with the corpus itself (the case we're talking about) you have
> > some nice tricks in here:
> >
> > On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <[email protected]>
> wrote:
> >>
> >>
> >> The output of that map task is something like:
> >>
> >> k:(n-1)gram v:ngram
> >>
> >
> > This is great right here - it helps you kill two birds with one stone:
> the
> > join
> > and the wordcount phases.
> >
> >
> >> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
> >>
> >> e.g:
> >> k:the best:1, v:best,2
> >> k:best of,1, v:best,2
> >> k:best of,1, v:of,2
> >> k:of times,1 v:of,2
> >> k:the best,1, v:the,1
> >> k:of times,1 v:1 v:times,1
> >>
> >
> > Yeah, once you're here, you're home free.  This should be really a rather
> > quick set of jobs, even on really big data, and even dealing with it as
> > text.
> >
> >
> >> I'm also wondering about the best way to handle input. Line by line
> >> processing would miss ngrams spanning lines, but full document
> >> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
> >> across sentence boundaries.
> >>
> >
> > These effects are just minor issues: you lose a little bit of signal on
> > line endings, and you pick up some noise catching ngrams across
> > sentence boundaries, but it's fractional compared to your whole set.
> > Don't try and to be too fancy and cram tons of lines together.  If your
> > data comes in different chunks than just one huge HDFS text file, you
> > could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
> > to reduce the newline error if necessary, but it's probably not needed.
> > The sentence boundary part gets washed out in the LLR step anyways
> > (because they'll almost always turn out to have a low LLR score).
> >
> > What I've found I've had to do sometimes, is something with stop words.
> > If you don't use stop words at all, you end up getting a lot of
> relatively
> > high LLR scoring ngrams like "up into", "he would", and in general
> pairings
> > of a relatively rare unigram with a pronoun or preposition.  Maybe there
> are
> > other ways of avoiding that, but I've found that you do need to take some
> > care with the stop words (but removing them altogether leads to some
> > weird looking ngrams if you want to display them somewhere).
> >
> >
> >> I'm interested in whether there's a more efficient way to structure
> >> the M/R passes. It feels a little funny to no-op a whole map cycle. It
> >> would almost be better if one could chain two reduces together.
> >>
> >
> > Beware premature optimization - try this on a nice big monster set on
> > a real cluster, and see how long it takes.  I have a feeling you'll be
> > pleasantly surprised.  But even before that - show us a patch, maybe
> > someone will have easy low-hanging fruit optimization tricks.
> >
> >  -jake
> >
>



-- 
Zaki Rahaman

Reply via email to