Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Ted Dunning Fri, 13 Nov 2009 10:54:36 -0800

On Fri, Nov 13, 2009 at 10:35 AM, Ken Krugler
<[email protected]>wrote:

> Hi Ted,
>
> On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:
>
>  I would opt for the most specific tokenization that is feasible (no
>> stemming, as much compounding as possible).
>>
>
> By "as much compounding as possible", do you mean you want the tokenizer to
> do as much splitting as possible, or as little?
>

My ultimate preference is to actually glue very common phrases into a single
term.  This can be reversed with a linear transformation.  This may not be
feasible for a first hack.

> E.g. "super-duper" should be left as-is, or turned into "super" and
> "duper"?
>

Left as is.  And New York should be tokenized as a single term.  Likewise
with "Staff writer of the wall street journal".

> Is there a particular configuration of Lucene tokenizers that you'd
> suggest?
>

I am not an expert, but I know of no tokenizers that will do this.  Lucene
typically retains positional information which means that searching for
phrases is relatively cheap (10x in search time, but most collections take
approximately zero time to search).

Maybe the best answer is to produce two vectorized versions, one heavily
stemmed and split apart (the micro-token approach) and another one that we
can progressively improve would be the mega-token version.

The final arbiter should be whoever does the work (you!).

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to