calculate wi = tfi * IDFi for each document.
If I have search results how can I calculate, using lucene's API, wi = tfi * IDFi for each document. wi= term weight tfi= term frequency in a document IDFi = inverse document frequency = log(D/dfi) dfi = document frequency or number of documents containing term i D= number of documents in my search result Thanks, Andrew - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and Documentum
Hi All, Has anyone had any experience using lucene to search a documentum respoitory? Thanks Andrew - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: how long should optimizing take
Thanks for the suggestion, Jian Chen's idea is very similar too. Probably optimizing that often is not necessary and not that critical for speeding up the searches. I'll try changing the index process not to optimize at all and execute the optimization independently of the indexing on a weekly bases. Ross -Original Message- From: Dan Armbrust [mailto:[EMAIL PROTECTED] Sent: Thursday, June 02, 2005 11:10 AM To: java-user@lucene.apache.org Subject: Re: how long should optimizing take I would run your optimize process in a separate thread, so that your web client doesn't have to wait for it to return. You may even want to set the optimize part up to run on a weekly schedule, at a low load time. I probably wouldn't reoptimize after every 30 documents, on an index that size. Optimizing takes a while on your index, because it basically has to copy the entire index to a new index, so it will take how ever long it takes to copy 2 GB's on your hardware + a small amount of overhead... Dan Angelov, Rossen wrote: I would like to bring that issue up again as I haven't resolved it yet and haven't found what's causing it. Any help, ideas or sharing experience are welcome! Thanks, Ross -Original Message- From: Angelov, Rossen Sent: Friday, May 27, 2005 10:42 AM To: 'java-user@lucene.apache.org' Subject: how long should optimizing take Hi, I'm having problems with the Lucene optimization. Two of the indexes are about 2BG big and every day about 30 documents are added to each of these indexes. At the end of the indexing the IndexWriter optimize() method is executed and it takes about 30 minutes to finish the optimization for each index. The indexing happens through a web service. A servlet takes an http request and executes methods to index the new documents and optimize the indexes. The problem is that the request takes too long to finish because of the optimization and the web server doesn't return a response. The browser will keep waiting forever. Has anybody else experienced similar behavior with the optimization process? Thanks, Ross This communication is intended solely for the addressee and is confidential and not for third party unauthorized distribution. This communication is intended solely for the addressee and is confidential and not for third party unauthorized distribution. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This communication is intended solely for the addressee and is confidential and not for third party unauthorized distribution.
RE: Indexing multiple languages
Thanks all for the useful comments. It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that include the language (e.g. title_en, title_cn) I quite like 4, because you can search with no language constraint, or with one as Paul suggests below. However, some non language-specific data might need to be repeated (e.g. dates), unless we had an extra Lucene document for all that. I wonder what the various pros and cons in terms of index size and performance would be in each case? I really don't have enough knowledge of Lucene to have any idea... Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ -Original Message- From: Paul Libbrecht [mailto:[EMAIL PROTECTED] Sent: 01 June 2005 04:10 To: java-user@lucene.apache.org Subject: Re: Indexing multiple languages Le 1 juin 05, à 01:12, Erik Hatcher a écrit : 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I would vote for option #2 as it gives the most flexibilty - you can query with or without concern for language. The way I've solved this is to make a different field-name per-language as our documents can be multilingual. What's then done is query expansion at query time: given a term-query for text, I duplicate it for each accepted language of the user with a factor related to the preference of the language (e.g. the q factor in Accept-Language http header). Presumably I could be using solution 2/ as well if my queries become too big, making several documents for each language of the document. I think it's very important to care about guessing the accepted languages of the user. Typically, the default behaviour of Google is to only give you matches in your primary language but then allow expansion in any language. On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. This one may need particular treatment. Tell us your success! paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: how long should optimizing take
I'll make sure no indexing is started before the optimization is done. Most likely Sunday will be the optimization day for the indexes and every other night the documents will be added to the index. Only searching will be available through the web service while optimizing, but this should not be a problem as an IndexReader will be opened, not a second IndexWriter. Ross -Original Message- From: Dan Armbrust [mailto:[EMAIL PROTECTED] Sent: Thursday, June 02, 2005 3:10 PM To: java-user@lucene.apache.org Subject: Re: how long should optimizing take You should be careful, however, not to end up with two VM instances each trying to open an index writer at the same time - one of them is going to fail. Aka, if someone using your web interface tries to add a new document to the index while you have the optimizer running standalone, the web interface is not going to be able to get a lock on the index to add the documents. Dan Angelov, Rossen wrote: Thanks for the suggestion, Jian Chen's idea is very similar too. Probably optimizing that often is not necessary and not that critical for speeding up the searches. I'll try changing the index process not to optimize at all and execute the optimization independently of the indexing on a weekly bases. Ross - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This communication is intended solely for the addressee and is confidential and not for third party unauthorized distribution.
Re: Indexing and Hit Highlighting OCR Data
This is a pretty interesting problem. I envy you. I would avoid the existing highlighter for your purposes -- highlighting in token space is a very differnet problem from highlihgting in 2D space. based on the XML sample you provided, it looks like your XML files are allready a tokenized form of the orriginal OCR data -- by which i mean the page has allready been tokenized into words who position is recorded. I would parse these XML docs to generate two things: 1) a stream of words for analysis/filtering (ie: stop words, stemming, synonyms) 2) a datastructure mapping words to lists of positions (ie: if the same word apears in multiple places, list the word once, followed by each set of coordinates) use #1 in the usual way, and add a serialized form of #2 to your index as a Stored Keyword -- at query time, the words from your initial query can be looked up in that data strucutre to find the regions to highlight : I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers. We have an XML based OCR format. A sample is below. We need to index the CONTENT attribute of the String element which is the easy part. We would like to be able find the hits within this XML document in order to use the positioning information to draw the highlight boxes on the image. It doesn't make a lot of sense to just extract the CONTENT and index that because we loose the positioning information. My second thought was to make a custom analyzer which dropped everything except for the content element and then used the highlighting class in the sandbox to reanalyze the XML document and mark the hits. With the marked hits in the XML we could find the position information and draw on the image. Has anyone else worked with OCR information and lucene. What was your approach? Does this approach seem sound? Any recommendations? : : Thanks, Corey : : TextLine HEIGHT=2307.0 WIDTH=2284.0 HPOS=1316.0 VPOS=123644.0 : String STYLEREFS=ID4 HEIGHT=1922.0 WIDTH=244.0 HPOS=1316.0 VPOS=123644.0 CONTENT=The WC=1.0/ : SP WIDTH=-244.0 HPOS=1560.0 VPOS=123644.0/ : String STYLEREFS=ID4 HEIGHT=1914.0 WIDTH=424.0 HPOS=1664.0 VPOS=123711.0 CONTENT=female WC=1.0/ : SP WIDTH=184.0 HPOS=1480.0 VPOS=123644.0/ : String STYLEREFS=ID4 HEIGHT=2174.0 WIDTH=240.0 HPOS=2192.0 VPOS=123711.0 CONTENT=lays WC=1.0/ : SP WIDTH=104.0 HPOS=2088.0 VPOS=123711.0/ : String STYLEREFS=ID4 HEIGHT=1981.0 WIDTH=360.0 HPOS=2528.0 VPOS=123711.0 CONTENT=about WC=1.0/ : SP WIDTH=236.0 HPOS=2292.0 VPOS=123711.0/ : String STYLEREFS=ID4 HEIGHT=1855.0 WIDTH=216.0 HPOS=3000.0 VPOS=123770.0 CONTENT=140 WC=1.0/ : SP WIDTH=112.0 HPOS=2888.0 VPOS=123711.0/ : String STYLEREFS=ID4 HEIGHT=1729.0 WIDTH=284.0 HPOS=3316.0 VPOS=124223.0 CONTENT=eggs WC=1.0/ : SP WIDTH=100.0 HPOS=3216.0 VPOS=123770.0/ : /TextLine : : : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing multiple languages
Hi Erik, I am a new comer to this list and please allow me to ask a dumb question. For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8. What is the default encoding for the StandardAnalyser. Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need to change some of the java programs to index Chinese and/or accept Chinese as search term. I was able to search for the HTML file if I used English word that appeared in the added HTML file. Thanks, Bob On May 31, 2005, Erik wrote: Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and detect if the character is an ASCII char, if it is, assembly them together and make it as a token, if not, then, make it as a Chinese word token. So, bottom line is, just one analyzer for all the text and do the if/else statement inside the analyzer. I would like to learn more thoughts about this! Thanks, Jian On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]