RE: Lucene, HTML and Hebrew

Steven A Rowe Tue, 22 Jan 2008 15:07:41 -0800

Hi Itamar,

In another thread, you wrote:

> Yesterday I sent an email to this group querying about some
> very important (to me...) features of Lucene. I'm giving it
> another chance before it goes unnoticed or forgotten. If it
> was too long please let me know and I will email a shorter
> list of questions....

I think I have something like a 30-second rule for posts on this list: if I 
can't figure out what the question is within 30 seconds, I move on.  Your post 
was so verbose that I gave up before I asked myself whether I could help.  
(Déjà vu - upon re-reading this paragraph, it sounds very much like something 
Hoss has said on this list...)

Although I answer your original post below, please don't take this as 
affirmation of your "reminder" approach.  In my experience, this strategy is 
interpreted as badgering, and tends to affect response rate in the opposite 
direction to that intended.

Short, focused questions will maximize the response rate here (and elsewhere, I 
suspect).  Also, it helps if there is some indication that the questioner has 
attempted to answer the question for themselves using readily available 
resources, but failed.

On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote:
> 1) How would Lucene treat the "normal" paragraph when they
> are added that way? Would proximity and frequency data be
> computed between paragraph1 and paragraph2 (last word of
> former with first word of latter)? What about proximity
> data between "h2" paragraph and the "normal" before or
> after it?

Lucene does not store proximity relations between data in different fields, 
only within individual fields.  Similarly, term frequencies are stored 
per-field, not per-document.

> 2) How would I set the boosts for the headers and footnotes?
> I'd rather have it stored within the index file than have to
> append it to each and every query I will execute, but I'm
> open to suggestions. I'm more interested in performance and
> flexibility.

AFAIK, there is no way currently in Lucene to set index-time per-field boosts - 
only per-document boosts are supported.

One very coarse-grained boosting trick you could use is to repeat the text of 
headers, etc., that you want to boost, e.g.:

        Doc->add(new Field("h2", "sub-header 1 $$$ sub-header 1", 
Field::STORE_NO |
Field::INDEX_TOKENIZED);

I included "$$$" as an example of how to break proximity between the first and 
last terms in the "sub-header 1" text - note, however, that this particular 
string may not serve this function properly, depending on the analyzer you 
choose. 

Note also that this is an issue elsewhere for you, since each addition of field 
information is understood by Lucene as contiguous.  That is, unless you do 
something to inhibit it, proximity matches will occur between the last term 
from one <h2> tag and the first term from the next <h2> tag in the same doc.

> 3) When executing a query against the above-mentioned index,
> how would I execute a set of words as a query (boolean quey
> using list of inflated words) without repeating this set of
> words for each and every field? Any support for something
> like *:word1 OR word2 OR word3 (instead of normal:(word1 OR
> word2 OR word3) AND quote:(word1 OR word2 OR word3) AND
> h1:(word1 OR word2 OR word3) etc...)?

MultiFieldQueryParser may do something like what you want:

<http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/queryParser/MultiFieldQueryParser.html>

> 4) Writing a Hebrew analyzer, I'm considering using a
> StandardAnalyzer and just ammend it so when it recognizes a
> Hebrew word it will call some function that will parse it
> correctly (there are some differences, such as quotes in
> the middle of a word are legitmate, also remove Niqqud).
> If it's a non-Hebrew word it will just continue as usual
> with its default behavior and functions.
> Also, this means I will index ALL words (Hebrew AND English)
> into the same index. The thinking behind this is to allow
> for searches with both Hebrew and English words to be
> performed successfully, taking into account there shouldn't
> be any downsides for indexing two languages within one
> index. I'm aware of the way Lucene stores words (not the
> whole word, but only the part that is different from the
> previous), but I really don't see how bad that's gonna
> be...

Not sure what the question is here - if you mean to ask "What are the impacts 
of including terms from two languages in a single index?" then my answer is "it 
depends"... 

For languages that share orthographies (e.g. Spanish and French, to a great 
extent), cognates (i.e. the same term meaning completely different things in 
the two languages) could cause degraded precision.  AFAIK, this is not an issue 
for Hebrew and English.  

The only other issue I can think of is that you will be taking symbols (words) 
from two completely different meaning-systems and merging them into the same 
index space.  For similar contexts ("should I have separate fields for each 
unit of information?"), the advice generally given on this list is to put 
everything into a single field.  In short: try the simplest thing first, test, 
and if the performance is not good enough, then increase the complexity of your 
solution, test, and iterate until it is.  But you probably already knew that :).

> 5) Where should a stemmer be used? As far as I see it, it
> should only be used for query inflation, am I right?

Generally, stemming increases recall (proportion of matching relevant docs 
among relevant docs in the entire corpus), and decreases precision (proportion 
of relevant docs among matching docs).

The standard advice is to use the same analysis pipeline at both index-time and 
query-time; in the context of your question, that would mean stemming in both 
places.

However, adding stems to a query, especially if you boosted them lower than the 
original terms, is probably a good strategy to maximize both precision and 
recall.  The cost of this approach is two-fold: a larger index than if you had 
performed index-time stemming; and increased query-time processing, hence lower 
query throughput.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene, HTML and Hebrew

Reply via email to