Hi Itamar, In another thread, you wrote:
> Yesterday I sent an email to this group querying about some > very important (to me...) features of Lucene. I'm giving it > another chance before it goes unnoticed or forgotten. If it > was too long please let me know and I will email a shorter > list of questions.... I think I have something like a 30-second rule for posts on this list: if I can't figure out what the question is within 30 seconds, I move on. Your post was so verbose that I gave up before I asked myself whether I could help. (Déjà vu - upon re-reading this paragraph, it sounds very much like something Hoss has said on this list...) Although I answer your original post below, please don't take this as affirmation of your "reminder" approach. In my experience, this strategy is interpreted as badgering, and tends to affect response rate in the opposite direction to that intended. Short, focused questions will maximize the response rate here (and elsewhere, I suspect). Also, it helps if there is some indication that the questioner has attempted to answer the question for themselves using readily available resources, but failed. On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote: > 1) How would Lucene treat the "normal" paragraph when they > are added that way? Would proximity and frequency data be > computed between paragraph1 and paragraph2 (last word of > former with first word of latter)? What about proximity > data between "h2" paragraph and the "normal" before or > after it? Lucene does not store proximity relations between data in different fields, only within individual fields. Similarly, term frequencies are stored per-field, not per-document. > 2) How would I set the boosts for the headers and footnotes? > I'd rather have it stored within the index file than have to > append it to each and every query I will execute, but I'm > open to suggestions. I'm more interested in performance and > flexibility. AFAIK, there is no way currently in Lucene to set index-time per-field boosts - only per-document boosts are supported. One very coarse-grained boosting trick you could use is to repeat the text of headers, etc., that you want to boost, e.g.: Doc->add(new Field("h2", "sub-header 1 $$$ sub-header 1", Field::STORE_NO | Field::INDEX_TOKENIZED); I included "$$$" as an example of how to break proximity between the first and last terms in the "sub-header 1" text - note, however, that this particular string may not serve this function properly, depending on the analyzer you choose. Note also that this is an issue elsewhere for you, since each addition of field information is understood by Lucene as contiguous. That is, unless you do something to inhibit it, proximity matches will occur between the last term from one <h2> tag and the first term from the next <h2> tag in the same doc. > 3) When executing a query against the above-mentioned index, > how would I execute a set of words as a query (boolean quey > using list of inflated words) without repeating this set of > words for each and every field? Any support for something > like *:word1 OR word2 OR word3 (instead of normal:(word1 OR > word2 OR word3) AND quote:(word1 OR word2 OR word3) AND > h1:(word1 OR word2 OR word3) etc...)? MultiFieldQueryParser may do something like what you want: <http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/queryParser/MultiFieldQueryParser.html> > 4) Writing a Hebrew analyzer, I'm considering using a > StandardAnalyzer and just ammend it so when it recognizes a > Hebrew word it will call some function that will parse it > correctly (there are some differences, such as quotes in > the middle of a word are legitmate, also remove Niqqud). > If it's a non-Hebrew word it will just continue as usual > with its default behavior and functions. > Also, this means I will index ALL words (Hebrew AND English) > into the same index. The thinking behind this is to allow > for searches with both Hebrew and English words to be > performed successfully, taking into account there shouldn't > be any downsides for indexing two languages within one > index. I'm aware of the way Lucene stores words (not the > whole word, but only the part that is different from the > previous), but I really don't see how bad that's gonna > be... Not sure what the question is here - if you mean to ask "What are the impacts of including terms from two languages in a single index?" then my answer is "it depends"... For languages that share orthographies (e.g. Spanish and French, to a great extent), cognates (i.e. the same term meaning completely different things in the two languages) could cause degraded precision. AFAIK, this is not an issue for Hebrew and English. The only other issue I can think of is that you will be taking symbols (words) from two completely different meaning-systems and merging them into the same index space. For similar contexts ("should I have separate fields for each unit of information?"), the advice generally given on this list is to put everything into a single field. In short: try the simplest thing first, test, and if the performance is not good enough, then increase the complexity of your solution, test, and iterate until it is. But you probably already knew that :). > 5) Where should a stemmer be used? As far as I see it, it > should only be used for query inflation, am I right? Generally, stemming increases recall (proportion of matching relevant docs among relevant docs in the entire corpus), and decreases precision (proportion of relevant docs among matching docs). The standard advice is to use the same analysis pipeline at both index-time and query-time; in the context of your question, that would mean stemming in both places. However, adding stems to a query, especially if you boosted them lower than the original terms, is probably a good strategy to maximize both precision and recall. The cost of this approach is two-fold: a larger index than if you had performed index-time stemming; and increased query-time processing, hence lower query throughput. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]