Re: n-gram indexing
On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote: At what point do I add n-grams? Does the order in which I add n-grams affect exact phrase queries later? My questions are (1) Should I add all the 1-grams followed by 2-grams followed by 3-grams..etc sentence by sentence OR (2) Add all the 1 grams of entire document first before starting 2-grams for the entire document? What is the general accepted notion of adding n-grams of a document? thanks, Rajesh I can't see any real advantage of storing n-grams explicitly. Just index the document and use phrase queries. Order is significant with phrase queries if I recall correctly, although you can use SpanNearQueries to look for unordered ngrams, although I don't know why you would want to! Perhaps if you explain a little more about what you are trying to achieve more generally, we can confirm that you don't need to mess with explicit indexing of indexing. Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: n-gram indexing
On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote: Intution behind adding n-grams is to boost naturally occurring larger phrases versus using phrase queries. For example, if I am searching for united states of america, I want the search results to return the documents ordered as follows Rank 1 - Documents containing all the words occurring together Rank 2 - Documents containing maximum number of words in the same sentence Rank 3 - Documents containing all the words but some might appear in the same sentence some may not Rank 4 - Documents containig atleast one or two words If we have a n-gram index, most probably document talking about united states gets preference over document containing united and states seperately. If I am correct, this can be achieved without using phrase queries. I am not sure if there is a better way to achieve the same effect. I don't think ngrams will help either. You could perform a set of individual queries. Firstly, run the phrase query to find hits with the exact phrase, then perhaps run a SpanNear query to find the docs with the terms close to each other. Thirdly, do a boolean AND query for all terms and fourthly run an OR boolean query. It will require a little extra processing of course, as you are technically executing 4 queries in 1. Naturally, this only has to be done when there are more than one term in the search query. Also, there is obviously going to be some duplication of hits, so you could use a HashMap when iterating of the Hits to ensure you get unique hits when the queries are collated. Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hypenated word
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik! I expect you'll need to do some pre-processing. Read in your text into a buffer, line-by-line. If a given line ends with a hyphen, you can manipulate the buffer to merge the hyphenated tokens. Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hypenated word
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote: On 6/13/05, Andy Roberts [EMAIL PROTECTED] wrote: On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik! I expect you'll need to do some pre-processing. Read in your text into a buffer, line-by-line. If a given line ends with a hyphen, you can manipulate the buffer to merge the hyphenated tokens. As Erik wrote it is not that simple, unfortunately. For example, if one line ends with read- and the next line begins with only the correct word is read-only not readonly. Whereas work- and ing should of course be merged into working. Markus Perhaps you do some crude checking against a dictionary. Combine the word anyway and check if it's in the dictionary. If so, keep it merged otherwise, it's a compound and so revert back to the hyphenated form. Word lists come part of all good OSS dictionary projects, as well as other language resources, like the BNC word lists etc. Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple languages
On Friday 03 Jun 2005 01:06, Bob Cheung wrote: For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8. What is the default encoding for the StandardAnalyser. The analysers themselves do not worry about encodings, per se. Java uses Unicode strings throughout, which is adequate enough to describing all languages. When reading in text files, it's a matter of letting the reader know which encoding the file is in, this helps Java to read in the text, and essentially map that encoding to the Unicode encoding. All the string operations, like analysing are done on these Unicode strings. So, the task is making sure the file reader you use to open a document for indexing is given the required information for correctly decoding your file. If you don't specify, Java will use one based on the locale that your OS uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for non-Latin texts and wouldn't be able to read in Chinese texts properly as the Latin1 encoding doesn't support such characters. You need to specify Big5 yourself. Read the info on InputStreamReaders: http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html Andy Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need to change some of the java programs to index Chinese and/or accept Chinese as search term. I was able to search for the HTML file if I used English word that appeared in the added HTML file. Thanks, Bob On May 31, 2005, Erik wrote: Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and detect if the character is an ASCII char, if it is, assembly them together and make it as a token, if not, then, make it as a Chinese word token. So, bottom line is, just one analyzer for all the text and do the if/else statement inside the analyzer. I would like to learn more thoughts about this! Thanks, Jian On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For
Re: Best way to purposely corrupt an index?
On Wednesday 20 Apr 2005 12:52, Kevin L. Cobb wrote: My policy on this type of exception handling is to only byte off what you can chew. If you catch an IOException, then you simply report to the user that an unexpected error has occurred and the search engine is unobtainable at the moment. Errors should be logged and developers should look at the specifics of the error to solve the issue. As you implied, either it's a corrupted index, a permission problem, or another access problem. Of course, you are making the assumption that Lucene is only used in the context of online search engines. This is not the case here. I have developed a stand alone application for text analysis, and I bundle the Lucene jar with it to store text in an efficient index. Once the software is on the users' computer, I don't want to be doing any maintenance of their indexes! (And I'm sure they'd prefer it that way too) Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Best way to purposely corrupt an index?
Hi, Seems like an odd request I'm sure. However, my application relies an index, and should the index become unusable for some unfortunate reason, I'd like my app to gracefully cope with this situation. Firstly, I need to know how to detect a broken index. Opening an IndexReader can potentially throw an IOException if a problem occurs, but presumably this will be thrown for other reasons, not just an unreadable index. Would the IndexReader.indexExists() be better? Secondly, to test how my code responds to broken indexes, I'd like to purposely break an index. Any suggestions, or will removing any file from the directory be sufficient? Many thanks, Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting the number of occurrences within a document
On Thursday 14 Apr 2005 15:15, Pablo Gomes Ludermir wrote: Hello all, I would like to get the following information from the index: 1. Given a term, how many times the term occurs in each document. Something like a triple: Term, Doc1, Freq , Term, Doc2, Freq, Term2, Docx, Freq, ... Is possible to do that? Regards, Pablo Off the top of my head... assuming you have an IndexReader (or MultiReader) called reader: TermEnum te = reader.terms(); while (te.next()) { Term currentTerm = te.term(); TermDocs docs = reader.termDocs(currentTerm); int docCounter = 1; while (docs.next()) { System.out.println(currentTerm.text() + , doc + docCount + , + docs.freq()); docCounter++; } } HTH, Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Terms Postion from Hits ...
I've managed something like this from a slightly different perspective. IndexReader ir = new IndexReader(yourIndex); String searchTerm = word; TermPositions tp = ir.termPositions(new Term(contents, searchTerm); tp.next(); int termFreq = tp.freq(); System.out.print(currentTerm.text()); for (int i=0; i termFreq; i++) { System.out.print( + tp.nextPosition()); } System.out.println(); ir.close(); This will print out something like: word 1 67 104 155 Where the term word occurs at positions 1, 67, 104 and 155 in the field contents of the index ir. HTH, Andy Roberts On Sunday 10 Apr 2005 15:52, Patricio Galeas wrote: Hello, I am new with Lucene. I have following problem. When I execute a search I receive the list of document Hits. I get without problem the content of the documents too: for (int i = 0; i hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get(content)); } Now, I would like to obtain the List of all Terms (and their corresponding position) from each document (hits.doc(i)). I have experimented creating a second index with the founded documents (Hits), and analyze it to obtain this information, but the algorithm work very slow. Do you have another idea? Thank You for your help! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]