Lucene, HTML and Hebrew

Itamar Syn-Hershko Mon, 21 Jan 2008 12:00:21 -0800

Hi all,

I'm starting in the process of creating Hebrew support for Lucene.
Specifically I'm using Clucene (which is an awesome and strong port), but
that shouldn't matter for my questions. Please, if you know of any info or
similar project let me know, it can save me loads of time and headaches. As
far as I could tell, such a thing is not in existance. Since I'm using
Clucene, the following questions reffer to Lucene < 2.0, should it make any
difference.


In my project I'm indexing HTML files which are either in English or Hebrew,
but may have words of the other language, and I need to enable searches for
both.
So in my application I load the HTML files one by one, taking text snippets
paragraph by paragraph (<p> tag). For each paragraph I strip it from HTML
tags and resolve HTML entities and then create a tokenized and indexed, not
stored, field. The HTML files are articles with main header, a few
sub-headers, quotes, normal text, and footnotes. I want to index all text as
normal, except for terms from headers which I want to boost (h1 the highest
boost, h2 less and so on). Footnotes should get negative boost (less than
normal text, which is neutral), quotes being indexed as field "quote" to
allow for quotes searches but are considered as normal on queries.
Due to the structure of the HTML files I have, calls to add new fields to a
document while parsing the HTML will be made in this order (psuedo code):

        Doc->add(new Field("h1", "Article Name / Adam Smith",
Field::STORE_NO | Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some introduction text",
Field::STORE_NO | Field::INDEX_TOKENIZED);
        Doc->add(new Field("h2", "sub-header 1", Field::STORE_NO |
Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("h2", "sub-header 2", Field::STORE_NO |
Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("h2", "sub-header 3", Field::STORE_NO |
Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO
| Field::INDEX_TOKENIZED);
        Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO
| Field::INDEX_TOKENIZED);

Based on the above info:

1) How would Lucene treat the "normal" paragraph when they are added that
way? Would proximity and frequency data be computed between paragraph1 and
paragraph2 (last word of former with first word of latter)? What about
proximity data between "h2" paragraph and the "normal" before or after it?

2) How would I set the boosts for the headers and footnotes? I'd rather have
it stored within the index file than have to append it to each and every
query I will execute, but I'm open to suggestions. I'm more interested in
performance and flexibility.

3) When executing a query against the above-mentioned index, how would I
execute a set of words as a query (boolean quey using list of inflated
words) without repeating this set of words for each and every field? Any
support for something like *:word1 OR word2 OR word3 (instead of
normal:(word1 OR word2 OR word3) AND quote:(word1 OR word2 OR word3) AND
h1:(word1 OR word2 OR word3) etc...)?

4) Writing a Hebrew analyzer, I'm considering using a StandardAnalyzer and
just ammend it so when it recognizes a Hebrew word it will call some
function that will parse it correctly (there are some differences, such as
quotes in the middle of a word are legitmate, also remove Niqqud). If it's a
non-Hebrew word it will just continue as usual with its default behavior and
functions.
Also, this means I will index ALL words (Hebrew AND English) into the same
index. The thinking behind this is to allow for searches with both Hebrew
and English words to be performed successfully, taking into account there
shouldn't be any downsides for indexing two languages within one index. I'm
aware of the way Lucene stores words (not the whole word, but only the part
that is different from the previous), but I really don't see how bad that's
gonna be...

5) Where should a stemmer be used? As far as I see it, it should only be
used for query inflation, am I right?

I know that might be quite heavy on you guys, but I will appreciate detailed
answers. Thanks in advance!

Itamar.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene, HTML and Hebrew

Reply via email to