[Nutch-dev] Myanmar Tokeniser

2005-05-31 Thread Keith Stribley
I am interested in adding support to Nutch for searching Myanmar language text. Myanmar (Burmese) often does not have spaces between words, so the process of segmenting into words is more difficult than just whitespace. I assume that I need to start by creating a Myanmar Tokenizer and Analyzer

[Nutch-dev] Re: Myanmar Tokeniser

2005-05-31 Thread Andrzej Bialecki
Keith Stribley wrote: I am interested in adding support to Nutch for searching Myanmar language text. Myanmar (Burmese) often does not have spaces between words, so the process of segmenting into words is more difficult than just whitespace. I assume that I need to start by creating a Myanmar

[Nutch-dev] Re: Myanmar Tokeniser

2005-05-31 Thread Ken Krugler
Keith Stribley wrote: I am interested in adding support to Nutch for searching Myanmar language text. Myanmar (Burmese) often does not have spaces between words, so the process of segmenting into words is more difficult than just whitespace. I assume that I need to start by creating a Myanmar

[Nutch-dev] Possible deadlock in PDFBox parser - with a fix.

2005-05-31 Thread Andrzej Bialecki
Hi, First, the symptoms: I was doing some tests on sites with many PDFs, and the Fetcher was gradually slowing down, until it became stuck. This was repeatable. A thread dump showed all threads waiting somewhere in PDFBox code (which is used by parse-pdf). In an email exchange with the author

[Nutch-dev] [jira] Updated: (NUTCH-54) Fetcher improvements

2005-05-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ] Andrzej Bialecki updated NUTCH-54: --- Attachment: final.diff This is the final version of the Fetcher improvements, for review. The most significant change is that now ProtocolStatus and

[Nutch-dev] Hard-coding of dedupField in OpenSearchServlet

2005-05-31 Thread stack
The OpenSearchServlet has a hardcoding of 'site' as the field to use deduping search results. I'd like to be able to dedup search results on fields other than just 'site'. For example, we have collections that may have multiple instances of an url in the index. For such collections,

[Nutch-dev] How to exclude content other than Script Style from indexing

2005-05-31 Thread Sundaramoorthy Kannan
Hi, If I have to exclude some parts of a web page from getting indexed, how can I do it? As I understand, DOMContentUtils class of HTML parser plugin currently ignores only SCRIPT, STYLE and comment text. Can I configure it to exclude some other tags too? Thanks, Kannan