Hi Itamar,
In another thread, you wrote:
> Yesterday I sent an email to this group querying about some
> very important (to me...) features of Lucene. I'm giving it
> another chance before it goes unnoticed or forgotten. If it
> was too long please let me know and I will email a shorter
> list of questions....
I think I have something like a 30-second rule for posts on this list: if I
can't figure out what the question is within 30 seconds, I move on. Your post
was so verbose that I gave up before I asked myself whether I could help.
(Déjà vu - upon re-reading this paragraph, it sounds very much like something
Hoss has said on this list...)
Although I answer your original post below, please don't take this as
affirmation of your "reminder" approach. In my experience, this strategy is
interpreted as badgering, and tends to affect response rate in the opposite
direction to that intended.
Short, focused questions will maximize the response rate here (and elsewhere, I
suspect). Also, it helps if there is some indication that the questioner has
attempted to answer the question for themselves using readily available
resources, but failed.
On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote:
> 1) How would Lucene treat the "normal" paragraph when they
> are added that way? Would proximity and frequency data be
> computed between paragraph1 and paragraph2 (last word of
> former with first word of latter)? What about proximity
> data between "h2" paragraph and the "normal" before or
> after it?
Lucene does not store proximity relations between data in different fields,
only within individual fields. Similarly, term frequencies are stored
per-field, not per-document.
> 2) How would I set the boosts for the headers and footnotes?
> I'd rather have it stored within the index file than have to
> append it to each and every query I will execute, but I'm
> open to suggestions. I'm more interested in performance and
> flexibility.
AFAIK, there is no way currently in Lucene to set index-time per-field boosts -
only per-document boosts are supported.
One very coarse-grained boosting trick you could use is to repeat the text of
headers, etc., that you want to boost, e.g.:
Doc->add(new Field("h2", "sub-header 1 $$$ sub-header 1",
Field::STORE_NO |
Field::INDEX_TOKENIZED);
I included "$$$" as an example of how to break proximity between the first and
last terms in the "sub-header 1" text - note, however, that this particular
string may not serve this function properly, depending on the analyzer you
choose.
Note also that this is an issue elsewhere for you, since each addition of field
information is understood by Lucene as contiguous. That is, unless you do
something to inhibit it, proximity matches will occur between the last term
from one <h2> tag and the first term from the next <h2> tag in the same doc.
> 3) When executing a query against the above-mentioned index,
> how would I execute a set of words as a query (boolean quey
> using list of inflated words) without repeating this set of
> words for each and every field? Any support for something
> like *:word1 OR word2 OR word3 (instead of normal:(word1 OR
> word2 OR word3) AND quote:(word1 OR word2 OR word3) AND
> h1:(word1 OR word2 OR word3) etc...)?
MultiFieldQueryParser may do something like what you want:
<http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/queryParser/MultiFieldQueryParser.html>
> 4) Writing a Hebrew analyzer, I'm considering using a
> StandardAnalyzer and just ammend it so when it recognizes a
> Hebrew word it will call some function that will parse it
> correctly (there are some differences, such as quotes in
> the middle of a word are legitmate, also remove Niqqud).
> If it's a non-Hebrew word it will just continue as usual
> with its default behavior and functions.
> Also, this means I will index ALL words (Hebrew AND English)
> into the same index. The thinking behind this is to allow
> for searches with both Hebrew and English words to be
> performed successfully, taking into account there shouldn't
> be any downsides for indexing two languages within one
> index. I'm aware of the way Lucene stores words (not the
> whole word, but only the part that is different from the
> previous), but I really don't see how bad that's gonna
> be...
Not sure what the question is here - if you mean to ask "What are the impacts
of including terms from two languages in a single index?" then my answer is "it
depends"...
For languages that share orthographies (e.g. Spanish and French, to a great
extent), cognates (i.e. the same term meaning completely different things in
the two languages) could cause degraded precision. AFAIK, this is not an issue
for Hebrew and English.
The only other issue I can think of is that you will be taking symbols (words)
from two completely different meaning-systems and merging them into the same
index space. For similar contexts ("should I have separate fields for each
unit of information?"), the advice generally given on this list is to put
everything into a single field. In short: try the simplest thing first, test,
and if the performance is not good enough, then increase the complexity of your
solution, test, and iterate until it is. But you probably already knew that :).
> 5) Where should a stemmer be used? As far as I see it, it
> should only be used for query inflation, am I right?
Generally, stemming increases recall (proportion of matching relevant docs
among relevant docs in the entire corpus), and decreases precision (proportion
of relevant docs among matching docs).
The standard advice is to use the same analysis pipeline at both index-time and
query-time; in the context of your question, that would mean stemming in both
places.
However, adding stems to a query, especially if you boosted them lower than the
original terms, is probably a good strategy to maximize both precision and
recall. The cost of this approach is two-fold: a larger index than if you had
performed index-time stemming; and increased query-time processing, hence lower
query throughput.
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]