Hi all, I'm starting in the process of creating Hebrew support for Lucene. Specifically I'm using Clucene (which is an awesome and strong port), but that shouldn't matter for my questions. Please, if you know of any info or similar project let me know, it can save me loads of time and headaches. As far as I could tell, such a thing is not in existance. Since I'm using Clucene, the following questions reffer to Lucene < 2.0, should it make any difference.
In my project I'm indexing HTML files which are either in English or Hebrew, but may have words of the other language, and I need to enable searches for both. So in my application I load the HTML files one by one, taking text snippets paragraph by paragraph (<p> tag). For each paragraph I strip it from HTML tags and resolve HTML entities and then create a tokenized and indexed, not stored, field. The HTML files are articles with main header, a few sub-headers, quotes, normal text, and footnotes. I want to index all text as normal, except for terms from headers which I want to boost (h1 the highest boost, h2 less and so on). Footnotes should get negative boost (less than normal text, which is neutral), quotes being indexed as field "quote" to allow for quotes searches but are considered as normal on queries. Due to the structure of the HTML files I have, calls to add new fields to a document while parsing the HTML will be made in this order (psuedo code): Doc->add(new Field("h1", "Article Name / Adam Smith", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some introduction text", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("h2", "sub-header 1", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("h2", "sub-header 2", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("h2", "sub-header 3", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph1", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph2", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph3", Field::STORE_NO | Field::INDEX_TOKENIZED); Doc->add(new Field("normal", "some text paragrph4", Field::STORE_NO | Field::INDEX_TOKENIZED); Based on the above info: 1) How would Lucene treat the "normal" paragraph when they are added that way? Would proximity and frequency data be computed between paragraph1 and paragraph2 (last word of former with first word of latter)? What about proximity data between "h2" paragraph and the "normal" before or after it? 2) How would I set the boosts for the headers and footnotes? I'd rather have it stored within the index file than have to append it to each and every query I will execute, but I'm open to suggestions. I'm more interested in performance and flexibility. 3) When executing a query against the above-mentioned index, how would I execute a set of words as a query (boolean quey using list of inflated words) without repeating this set of words for each and every field? Any support for something like *:word1 OR word2 OR word3 (instead of normal:(word1 OR word2 OR word3) AND quote:(word1 OR word2 OR word3) AND h1:(word1 OR word2 OR word3) etc...)? 4) Writing a Hebrew analyzer, I'm considering using a StandardAnalyzer and just ammend it so when it recognizes a Hebrew word it will call some function that will parse it correctly (there are some differences, such as quotes in the middle of a word are legitmate, also remove Niqqud). If it's a non-Hebrew word it will just continue as usual with its default behavior and functions. Also, this means I will index ALL words (Hebrew AND English) into the same index. The thinking behind this is to allow for searches with both Hebrew and English words to be performed successfully, taking into account there shouldn't be any downsides for indexing two languages within one index. I'm aware of the way Lucene stores words (not the whole word, but only the part that is different from the previous), but I really don't see how bad that's gonna be... 5) Where should a stemmer be used? As far as I see it, it should only be used for query inflation, am I right? I know that might be quite heavy on you guys, but I will appreciate detailed answers. Thanks in advance! Itamar. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]