Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Ted Dunning Tue, 03 Nov 2009 18:38:39 -0800

I would opt for the most specific tokenization that is feasible (no
stemming, as much compounding as possible).  The rationale for this is that
stemming and uncompounding can be added by linear transformations of the
matrix at any time.

The only serious issue with this is the problem of overlapping compound
words.

On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <[email protected]>wrote:

> I assume there would also be an issue of which tokenizer to use to create
> the terms from the text.
>
> And possibly issues around storing separate vectors for (at least) title
> vs. content?
>
> Anybody have input on either of these?
>
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to