I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer
Keith Stribley wrote:
I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar
Keith Stribley wrote:
I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar
Hi,
First, the symptoms: I was doing some tests on sites with many PDFs, and
the Fetcher was gradually slowing down, until it became stuck. This was
repeatable. A thread dump showed all threads waiting somewhere in PDFBox
code (which is used by parse-pdf). In an email exchange with the author
[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]
Andrzej Bialecki updated NUTCH-54:
---
Attachment: final.diff
This is the final version of the Fetcher improvements, for review. The most
significant change is that now ProtocolStatus and
The OpenSearchServlet has a hardcoding of 'site' as the field to use
deduping search results. I'd like to be able to dedup search results on
fields other than just 'site'.
For example, we have collections that may have multiple instances of an
url in the index. For such collections,
Hi,
If I have to exclude some parts of a web page from getting indexed, how
can I do it? As I understand, DOMContentUtils class of HTML parser
plugin currently ignores only SCRIPT, STYLE and comment text. Can I
configure it to exclude some other tags too?
Thanks,
Kannan