Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote:
 At what point do I add n-grams? Does the order in which I add n-grams
 affect exact phrase queries later? My questions are

 (1) Should I add all the 1-grams followed by 2-grams followed by
 3-grams..etc sentence by sentence OR

 (2) Add all the 1 grams of entire document first before starting 2-grams
 for the entire document?

 What is the general accepted notion of adding n-grams of a document?

 thanks,

 Rajesh

I can't see any real advantage of storing n-grams explicitly. Just index the 
document and use phrase queries. Order is significant with phrase queries if 
I recall correctly, although you can use SpanNearQueries to look for 
unordered ngrams, although I don't know why you would want to!

Perhaps if you explain a little more about what you are trying to achieve more 
generally, we can confirm that you don't need to mess with explicit indexing 
of indexing.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote:
 Intution behind adding n-grams is to boost naturally occurring larger
 phrases versus using phrase queries. For example, if I am searching for
 united states of america, I want the search results to return the
 documents ordered as follows

 Rank 1 - Documents containing all the words occurring together
 Rank 2 - Documents containing maximum number of words in the same
 sentence
 Rank 3 - Documents containing all the words but some might appear in the
 same sentence some may not
 Rank 4 - Documents containig atleast one or two words

 If we have a n-gram index, most probably document talking about united
 states gets preference over document containing united and states
 seperately. If I am correct, this can be achieved without using phrase
 queries. I am not sure if there is a better way to achieve the same
 effect.


I don't think ngrams will help either. You could perform a set of individual 
queries. Firstly, run the phrase query to find hits with the exact phrase, 
then perhaps run a SpanNear query to find the docs with the terms close to 
each other. Thirdly, do a boolean AND query for all terms and fourthly run an 
OR boolean query. It will require a little extra processing of course, as you 
are technically executing 4 queries in 1. Naturally, this only has to be done 
when there are more than one term in the search query. Also, there is 
obviously going to be some duplication of hits, so you could use a HashMap 
when iterating of the Hits to ensure you get unique hits when the queries are 
collated.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
 I see, the list of exceptions makes this a lot more complicated than I
 thought... Thanks a lot, Erik!


I expect you'll need to do some pre-processing. Read in your text into a 
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate 
the buffer to merge the hyphenated tokens.

Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
 On 6/13/05, Andy Roberts [EMAIL PROTECTED] wrote:
  On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
   I see, the list of exceptions makes this a lot more complicated than I
   thought... Thanks a lot, Erik!
 
  I expect you'll need to do some pre-processing. Read in your text into a
  buffer, line-by-line. If a given line ends with a hyphen, you can
  manipulate the buffer to merge the hyphenated tokens.

 As Erik wrote it is not that simple, unfortunately. For example, if
 one line ends with read- and the next line begins with only the
 correct word is read-only not readonly. Whereas work- and ing
 should of course be merged into working.

 Markus

Perhaps you do some crude checking against a dictionary. Combine the word 
anyway and check if it's in the dictionary. If so, keep it merged otherwise, 
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other 
language resources, like the BNC word lists etc.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
 For the StandardAnalyzer, will it have to be modified to accept
 different character encodings.

 We have customers in China, Taiwan and Hong Kong.  Chinese data may come
 in 3 different encoding:  Big5, GB and UTF8.

 What is the default encoding for the StandardAnalyser.

The analysers themselves do not worry about encodings, per se. Java uses 
Unicode strings throughout, which is adequate enough to describing all 
languages.  When reading in text files, it's a matter of letting the reader 
know which encoding the file is in, this helps Java to read in the text, and 
essentially map that encoding to the Unicode encoding. All the string 
operations, like analysing are done on these Unicode strings.

So, the task is making sure the file reader you use to open a document for 
indexing is given the required information for correctly decoding your file. 
If you don't specify, Java will use one based on the locale that your OS 
uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for 
non-Latin texts and wouldn't be able to read in Chinese texts properly as the 
Latin1 encoding doesn't support such characters. You need to specify Big5 
yourself. Read the info on InputStreamReaders:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html

Andy


 Btw, I did try running the lucene demo (web template) to index the HTML
 files after I added one including English and Chinese characters.  I was
 not able to search for any Chinese in that HTML file (returned no hits).
 I wonder whether I need to change some of the java programs to index
 Chinese and/or accept Chinese as search term.  I was able to search for
 the HTML file if I used English word that appeared in the added HTML
 file.

 Thanks,

 Bob


 On May 31, 2005, Erik wrote:

 Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
 will keep English as-is (removing stop words, lowercasing, and such)
 and separate CJK characters into separate tokens also.

  Erik

 On May 31, 2005, at 5:49 PM, jian chen wrote:
  Hi,
 
  Interesting topic. I thought about this as well. I wanted to index
  Chinese text with English, i.e., I want to treat the English text
  inside Chinese text as English tokens rather than Chinese text tokens.
 
  Right now I think maybe I have to write a special analyzer that takes
  the text input, and detect if the character is an ASCII char, if it
  is, assembly them together and make it as a token, if not, then, make
  it as a Chinese word token.
 
  So, bottom line is, just one analyzer for all the text and do the
  if/else statement inside the analyzer.
 
  I would like to learn more thoughts about this!
 
  Thanks,
 
  Jian
 
  On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:
  Hi all,
 
  The DSpace (www.dspace.org) currently uses Lucene to index metadata
  (Dublin Core standard) and extracted full-text content of documents
  stored in it.  Now the system is being used globally, it needs to
  support multi-language indexing.
 
  I've looked through the mailing list archives etc. and it seems it's
  easy to plug in analyzers for different languages.
 
  What if we're trying to index multiple languages in the same
  site?  Is
  it best to have:
 
  1/ one index for all languages
  2/ one index for all languages, with an extra language field so
  searches
  can be constrained to a particular language
  3/ separate indices for each language?
 
  I don't fully understand the consequences in terms of performance for
  1/, but I can see that false hits could turn up where one word
  appears
  in different languages (stemming could increase the changes of this).
  Also some languages' analyzers are quite dramatically different (e.g.
  the Chinese one which just treats every character as a separate
  token/word).
 
  On the other hand, if people are searching for proper nouns in
  metadata
  (e.g. DSpace) it may be advantageous to search all languages at
  once.
 
 
  I'm also not sure of the storage and performance consequences of 2/.
 
  Approach 3/ seems like it might be the most complex from an
  implementation/code point of view.
 
  Does anyone have any thoughts or recommendations on this?
 
  Many thanks,
 
   Robert Tansley / Digital Media Systems Programme / HP Labs
http://www.hpl.hp.com/personal/Robert_Tansley/
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For 

Re: Best way to purposely corrupt an index?

2005-04-21 Thread Andy Roberts
On Wednesday 20 Apr 2005 12:52, Kevin L. Cobb wrote:
 My policy on this type of exception handling is to only byte off what
 you can chew. If you catch an IOException, then you simply report to the
 user that an unexpected error has occurred and the search engine is
 unobtainable at the moment. Errors should be logged and developers
 should look at the specifics of the error to solve the issue. As you
 implied, either it's a corrupted index, a permission problem, or another
 access problem.


Of course, you are making the assumption that Lucene is only used in the 
context of online search engines. This is not the case here. I have developed 
a stand alone application for text analysis, and I bundle the Lucene jar with 
it to store text in an efficient index. Once the software is on the users' 
computer, I don't want to be doing any maintenance of their indexes! (And I'm 
sure they'd prefer it that way too)

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Best way to purposely corrupt an index?

2005-04-19 Thread Andy Roberts
Hi,

Seems like an odd request I'm sure. However, my application relies an index, 
and should the index become unusable for some unfortunate reason, I'd like my 
app to gracefully cope with this situation.

Firstly, I need to know how to detect a broken index. Opening an IndexReader 
can potentially throw an IOException if a problem occurs, but presumably this 
will be thrown for other reasons, not just an unreadable index. Would the 
IndexReader.indexExists() be better?

Secondly, to test how my code responds to broken indexes, I'd like to 
purposely break an index. Any suggestions, or will removing any file from the 
directory be sufficient?

Many thanks,
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: getting the number of occurrences within a document

2005-04-14 Thread Andy Roberts
On Thursday 14 Apr 2005 15:15, Pablo Gomes Ludermir wrote:
 Hello all,

 I would like to get the following information from the index:

 1. Given a term, how many times the term occurs in each document.
 Something like a triple:
  Term, Doc1, Freq , Term, Doc2, Freq, Term2, Docx, Freq, ...

 Is possible to do that?


 Regards,
 Pablo

Off the top of my head... assuming you have an IndexReader (or MultiReader) 
called reader:

TermEnum te = reader.terms();

while (te.next()) {
Term currentTerm = te.term();

TermDocs docs = reader.termDocs(currentTerm);
int docCounter = 1;
while (docs.next()) {
System.out.println(currentTerm.text() + , doc + docCount + , 
 + docs.freq());
docCounter++;
}
}

HTH,

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Terms Postion from Hits ...

2005-04-11 Thread Andy Roberts
I've managed something like this from a slightly different perspective.

IndexReader ir = new IndexReader(yourIndex);

String searchTerm = word;

TermPositions tp = ir.termPositions(new Term(contents, searchTerm);

tp.next();
int termFreq = tp.freq();
System.out.print(currentTerm.text());

for (int i=0; i  termFreq; i++) {
System.out.print(  + tp.nextPosition());
}
System.out.println();

ir.close();

This will print out something like:

word 1 67 104 155

Where the term word occurs at positions 1, 67, 104 and 155 in the field 
contents of the index ir.

HTH,
Andy Roberts

On Sunday 10 Apr 2005 15:52, Patricio Galeas wrote:
 Hello,
 I am new with Lucene. I have following problem.
 When I execute a search I receive the list of document Hits.
 I get without problem the content of the documents too:

 for (int i = 0; i  hits.length(); i++) {
   Document doc = hits.doc(i);
   System.out.println(doc.get(content));
 }

 Now, I would like to obtain the List of all Terms (and their corresponding
 position) from each document (hits.doc(i)).

 I have experimented creating a second index with the founded documents
 (Hits), and analyze it to obtain this information, but the algorithm work
 very slow.

 Do you have another idea?

 Thank You for your help!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]