date:20070201

RE: Score

2007-02-01 Thread DECAFFMEYER MATHIEU

Thank u Chris for your support. 

__
Matt



-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 12:54 AM
To: java-user@lucene.apache.org
Subject: RE: Score

*  This message comes from the Internet Network *


: Have you looked at the constructor for BooleanQuery and
: tried passing true to disable the Coord factor?
:
: Thanks Chris, this is exactly what I want,
: but I am working with lucene 1.4.3 because I have to for some reasons,
:
: Is there any equivalent ?!

if you look atteh source for it, it's fairly trivial ... you should be
able to putthe same logic into a simple little helper function you use
when making BooleanQueries.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boost/Scoring question

2007-02-01 Thread Antony Bowesman


Hi Chris,


: If I search for a document where the field boost is 0.0 then the document is 
not
: found I just search that field.  Is this expected???

you mean you search on:   A^0and get no results even though
documents contain A, and if you search on:   +A^0  B^1 you see
those documents?


It's the index time boost, rather than query time boost.  This short example 
shows the behaviour of searches for


+A
+A +B
+B

where A was indexed with boost 0.0 and B with 1.0


IndexWriter writer = new IndexWriter(TestTools.getRoot(),
 new StandardAnalyzer(), true);
Field f1 = new Field(subject, subject - boost factor 0.0F,
 Field.Store.YES, Field.Index.TOKENIZED);
f1.setBoost(0.0F);
Field f2 = new Field(body, body - boost factor 1.0F, Field.Store.YES,
  Field.Index.TOKENIZED);
f2.setBoost(1.0F);
Document doc = new Document();
doc.add(f1);
doc.add(f2);
writer.addDocument(doc);
writer.close();

IndexSearcher searcher = new IndexSearcher(TestTools.getRoot());
QueryParser qp;
Query query;
Hits hits;
Explanation explanation = null;

//  Match on a single zero boost field
qp = new QueryParser(subject, new StandardAnalyzer());
query = qp.parse(+subject);
hits = searcher.search(query);
System.out.println(Search just subject field, no hit found with boost 0);

//  Match on both fields
qp = new QueryParser(subject, new StandardAnalyzer());
query = qp.parse(+subject:subject +body:body);
hits = searcher.search(query);
System.out.println(Search +subject +body, match found, score on hit= +
   hits.score(0));
explanation = searcher.explain(query, 0);
System.out.println(explanation);

//  Match on a single non zero boost field
qp = new QueryParser(body, new StandardAnalyzer());
query = qp.parse(+body:body);
hits = searcher.search(query);
System.out.println(Search just on body field= + hits.score(0));
explanation = searcher.explain(query, 0);
System.out.println(explanation);


if you plan on using Hits, i would suggest requiring that boosts be 0 ..
if you want to start dealing with raw scores, then boosts can definitely
be 0.


Hits is sufficient for now, but that may change.

Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU

Hi,

I have a list of filenames like
Corporate.htm
Logistics.htm
Merchant.htm

that need to be deleted.

For now on I  give this list to my Search application that reads the
idnex and give the results, and if the path contains one of the
filenames, I don't display this hit ... Not really proper programming
...

Is there a way to delete the document in the index instead with this
information ?

Thank u.

__

   Matt




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

problem with field.setboost(5.0f) on lucene 2.00

2007-02-01 Thread liquideshark


iam building up a search engine using lucene 2.0, and iam having problem
using the term boost setboost a part of my code is :
and my code is :

doc.add(new
Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED ));
doc.getField(title).setboost(5.0f);//  === the boost wont update to 5.0
it remain 1.0
writer.addDocument(doc);
writer.optimize();
writer.close();

but when i look up in the index created the field title is still 1.0
can some one help me thx
-- 
View this message in context: 
http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Deleting document by file name

2007-02-01 Thread Erick Erickson


Believe it or not, you delete items with IndexReader G. You can either
delete by document ID or by Term. Be aware that currently open searchers
will still find these documents (even after they have been deleted) until
the *searcher* is closed and reopened.

Erick

On 2/1/07, DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote:


 Hi,

I have a list of filenames like
Corporate.htm
Logistics.htm
Merchant.htm

that need to be deleted.

For now on I  give this list to my Search application that reads the idnex
and give the results, and if the path contains one of the filenames, I don't
display this hit ... Not really proper programming ...

Is there a way to delete the document in the index instead with this
information ?

Thank u.

*__*

*   Matt***



Internet communications are not secure and therefore Fortis Banque
Luxembourg S.A. does not accept legal responsibility for the contents of
this message. The information contained in this e-mail is confidential and
may be legally privileged. It is intended solely for the addressee. If you
are not the intended recipient, any disclosure, copying, distribution or any
action taken or omitted to be taken in reliance on it, is prohibited and may
be unlawful. Nothing in the message is capable or intended to create any
legally binding obligations on either party and it is not intended to
provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem with field.setboost(5.0f) on lucene 2.00

2007-02-01 Thread Erick Erickson


I haven't played with boosts, but I suspect your ordering is wrong. You've
already added the field to the document before you set the boost. Try
Field f = new Field()...

f.setBoost()

doc.add(f).

writer.addDoc(doc)..

Best
Erick

On 2/1/07, liquideshark [EMAIL PROTECTED] wrote:



iam building up a search engine using lucene 2.0, and iam having problem
using the term boost setboost a part of my code is :
and my code is :

doc.add(new
Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED ));
doc.getField(title).setboost(5.0f);//  === the boost wont update to 5.0
it remain 1.0
writer.addDocument(doc);
writer.optimize();
writer.close();

but when i look up in the index created the field title is still 1.0
can some one help me thx
--
View this message in context:
http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem with field.setboost(5.0f) on lucene 2.00

2007-02-01 Thread liquideshark


Yes you are right 
but i have change it to:

Field tiTle = new
Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED );
tiTle.setBoost(6.1f);
doc.add(tiTle);
---
it still dont make any change on the boost value, for information i use
luke.jar to see if the value had change

nice reading you again
  Tandina






Erick Erickson wrote:
 
 I haven't played with boosts, but I suspect your ordering is wrong. You've
 already added the field to the document before you set the boost. Try
 Field f = new Field()...
 
 f.setBoost()
 
 doc.add(f).
 
 writer.addDoc(doc)..
 
 Best
 Erick
 
 On 2/1/07, liquideshark [EMAIL PROTECTED] wrote:


 iam building up a search engine using lucene 2.0, and iam having problem
 using the term boost setboost a part of my code is :
 and my code is :

 doc.add(new
 Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED ));
 doc.getField(title).setboost(5.0f);//  === the boost wont update to
 5.0
 it remain 1.0
 writer.addDocument(doc);
 writer.optimize();
 writer.close();

 but when i look up in the index created the field title is still 1.0
 can some one help me thx
 --
 View this message in context:
 http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 
 

-- 
View this message in context: 
http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8748508
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU

If I have the path of the document,
I cannot find the ID ? 

__
   Matt



-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 2:09 PM
To: java-user@lucene.apache.org
Subject: Re: Deleting document by file name

*  This message comes from the Internet Network *

Believe it or not, you delete items with IndexReader G. You can either
delete by document ID or by Term. Be aware that currently open searchers
will still find these documents (even after they have been deleted)
until
the *searcher* is closed and reopened.

Erick

On 2/1/07, DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote:

  Hi,

 I have a list of filenames like
 Corporate.htm
 Logistics.htm
 Merchant.htm
 
 that need to be deleted.

 For now on I  give this list to my Search application that reads the
idnex
 and give the results, and if the path contains one of the filenames, I
don't
 display this hit ... Not really proper programming ...

 Is there a way to delete the document in the index instead with this
 information ?

 Thank u.

 *__*

 *   Matt***


 
 Internet communications are not secure and therefore Fortis Banque
 Luxembourg S.A. does not accept legal responsibility for the contents
of
 this message. The information contained in this e-mail is confidential
and
 may be legally privileged. It is intended solely for the addressee. If
you
 are not the intended recipient, any disclosure, copying, distribution
or any
 action taken or omitted to be taken in reliance on it, is prohibited
and may
 be unlawful. Nothing in the message is capable or intended to create
any
 legally binding obligations on either party and it is not intended to
 provide legal advice.
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Building lucene index using 100 Gb Mobile HardDisk

2007-02-01 Thread maureen tanuwidjaja

Dear All,
  
  I was indexing 660,000 XML documents.The unoptimized index file was  
successfully built in about 17 hrs...This index file resides in my D  drive 
which has the free space 38 Gb.This space is insufficient for  optimizing the 
index file --I read Lucene documentation said about  its requirement of 
substantial temporary free disk space
  
  Hence,I now put my index file in the Mobile HardDisk which has the capacity 
of 100 Gb,and restart indexing.
  
  However...there is an exception saying that :
  
  java.io.IOException: There is not enough space on the disk
  
  when I index the 120,000th documents
  
  
  It is a pretty weird thing!!...I keep thinking what is happening...Is  there 
anything to do with the file system used?my D drive use NTFS file  system 
whilst this Mobile HDD use FAT32...
  
  
  Any comments/suggestion pls?
  
  THanks and best Regards,
  Maureen

 
-
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food  Drink QA.

RE: Deleting document by file name

2007-02-01 Thread WATHELET Thomas

do something like this:
public class Index extends IndexModifier { ...
 
 public int deleteDocuments(String field, String value) throws
IOException {
return super.deleteDocuments(new Term(field, value));
}
 
use like this :
index.deleteDocuments(filed name, field value);


  _  

From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED]

Sent: 01 February 2007 09:53
To: java-user@lucene.apache.org
Subject: Deleting document by file name



Hi, 

I have a list of filenames like 
Corporate.htm 
Logistics.htm 
Merchant.htm 
 
that need to be deleted. 

For now on I  give this list to my Search application that reads
the idnex and give the results, and if the path contains one of the
filenames, I don't display this hit ... Not really proper programming
...

Is there a way to delete the document in the index instead with
this information ? 

Thank u. 

__ 

   Matt 




Internet communications are not secure and therefore Fortis
Banque Luxembourg S.A. does not accept legal responsibility for the
contents of this message. The information contained in this e-mail is
confidential and may be legally privileged. It is intended solely for
the addressee. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in
reliance on it, is prohibited and may be unlawful. Nothing in the
message is capable or intended to create any legally binding obligations
on either party and it is not intended to provide legal advice.

RE: Building lucene index using 100 Gb Mobile HardDisk

2007-02-01 Thread Damien McCarthy

FAT 32 imposes a lower file size limitation than NTF. Attempts to create
files greater that 4Gig on FAT32 will throw error you are seeing.

-Original Message-
From: maureen tanuwidjaja [mailto:[EMAIL PROTECTED] 
Sent: 01 February 2007 14:22
To: java-user@lucene.apache.org
Subject: Building lucene index using 100 Gb Mobile HardDisk

Dear All,

  I was indexing 660,000 XML documents.The unoptimized index file was
successfully built in about 17 hrs...This index file resides in my D  drive
which has the free space 38 Gb.This space is insufficient for  optimizing
the index file --I read Lucene documentation said about  its requirement of
substantial temporary free disk space

  Hence,I now put my index file in the Mobile HardDisk which has the
capacity of 100 Gb,and restart indexing.

  However...there is an exception saying that :

  java.io.IOException: There is not enough space on the disk

  when I index the 120,000th documents

  It is a pretty weird thing!!...I keep thinking what is happening...Is
there anything to do with the file system used?my D drive use NTFS file
system whilst this Mobile HDD use FAT32...

  Any comments/suggestion pls?

  THanks and best Regards,
  Maureen

-
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food  Drink QA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU

I see now :)
Thank u all for your support

__
   Matt



-Original Message-
From: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 3:28 PM
To: java-user@lucene.apache.org
Subject: RE: Deleting document by file name

*  This message comes from the Internet Network *

do something like this:
public class Index extends IndexModifier { ...
 
 public int deleteDocuments(String field, String value) throws
IOException {
return super.deleteDocuments(new Term(field, value));
}
 
use like this :
index.deleteDocuments(filed name, field value);


  _  

From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED]

Sent: 01 February 2007 09:53
To: java-user@lucene.apache.org
Subject: Deleting document by file name



Hi, 

I have a list of filenames like 
Corporate.htm 
Logistics.htm 
Merchant.htm 
 
that need to be deleted. 

For now on I  give this list to my Search application that reads
the idnex and give the results, and if the path contains one of the
filenames, I don't display this hit ... Not really proper programming
...

Is there a way to delete the document in the index instead with
this information ? 

Thank u. 

__ 

   Matt 




Internet communications are not secure and therefore Fortis
Banque Luxembourg S.A. does not accept legal responsibility for the
contents of this message. The information contained in this e-mail is
confidential and may be legally privileged. It is intended solely for
the addressee. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in
reliance on it, is prohibited and may be unlawful. Nothing in the
message is capable or intended to create any legally binding obligations
on either party and it is not intended to provide legal advice.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Please Help me on Lucene

2007-02-01 Thread Christoph Pächter

I am also only novice, but that should work for you.

One row in your table == one doc in lucene:

I would indice it like that for one row/document:

Document doc = new Document();

doc.add(new Field(prod_Id
doc.add(new Field(prod_name...
...
writer.addDocument(doc);



Now check your index using Luke: http://www.getopt.org/luke/
Just start the web version quickly and look at your index. (Luke is simple and
nice to use, mostly self-explanatory)
You can search for your value and look at the docs. So use it to check your 
query...


If you search for a value, in your class you get the hits-object.

Hits hits = searcher.search(query);

Document doc = hits.doc(i);//gives you the document (rank i), which is one row.

doc.get(prod_Id); to retrieve the fieldvalues

Cheers,
Christoph

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Building lucene index using 100 Gb Mobile HardDisk

2007-02-01 Thread maureen tanuwidjaja

Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD..

Damien McCarthy [EMAIL PROTECTED] wrote:  FAT 32 imposes a lower file size 
limitation than NTF. Attempts to create
files greater that 4Gig on FAT32 will throw error you are seeing.

-Original Message-
From: maureen tanuwidjaja [mailto:[EMAIL PROTECTED] 
Sent: 01 February 2007 14:22
To: java-user@lucene.apache.org
Subject: Building lucene index using 100 Gb Mobile HardDisk

Dear All,

  I was indexing 660,000 XML documents.The unoptimized index file was
successfully built in about 17 hrs...This index file resides in my D  drive
which has the free space 38 Gb.This space is insufficient for  optimizing
the index file --I read Lucene documentation said about  its requirement of
substantial temporary free disk space

  Hence,I now put my index file in the Mobile HardDisk which has the
capacity of 100 Gb,and restart indexing.

  However...there is an exception saying that :

  java.io.IOException: There is not enough space on the disk

  when I index the 120,000th documents

  It is a pretty weird thing!!...I keep thinking what is happening...Is
there anything to do with the file system used?my D drive use NTFS file
system whilst this Mobile HDD use FAT32...

  Any comments/suggestion pls?

  THanks and best Regards,
  Maureen

-
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food  Drink QA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
Never Miss an Email
Stay connected with Yahoo! Mail on your mobile. Get started!

Locking in Lucene 2.0

2007-02-01 Thread Kadlabalu, Hareesh

Hi,
I am starting to work with Lucene 2.0 and I noticed that we can no
longer create an FSDirectory using a LockFactory. 

Could someone point me to some discussion or documentation related to
locking and what has changed in terms of best practices? It appears that
the only way to build custom locking is to write my own Directory
implementation. 

Thanks 
-hareesh


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Locking in Lucene 2.0

2007-02-01 Thread Michael McCandless


Kadlabalu, Hareesh wrote:

Hi,
I am starting to work with Lucene 2.0 and I noticed that we can no
longer create an FSDirectory using a LockFactory. 


Could someone point me to some discussion or documentation related to
locking and what has changed in terms of best practices? It appears that
the only way to build custom locking is to write my own Directory
implementation. 


I think you mean Lucene trunk (not 2.0)?  LockFactory hasn't been
released yet.

With the trunk, you can still instantiate an FSDirectory with your own
LockFactory using FSDirectory.getDirectory(*)?

Maybe you are referring to the boolean create argument?  That was
deprecated/removed from FSDirectory.getDirectory as part of:

http://issues.apache.org/jira/browse/LUCENE-773

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Advices on a replacement of Lucene gap encoding scheme?

2007-02-01 Thread Thang Luong Minh


Dear all,

I am happy to send my first email to Lucene community as after subscribing
to the mailing list, I haven't actually joined the community, just standing
aside and following many intersting threads.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the fixed binary
coding scheme mentioned in the paper Index compression using Fixed Binary
Codewords by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
here 
http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.htmlhttp://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html).
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the fixed binary scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

Best regards,

Luong Minh Thang

Re: Please Help me on Lucene

2007-02-01 Thread Chris Hostetter


Please, do not ever, under any circumstances at all, cross post a
message to all of these lists -- there is absolutely no reason for it, and
doing so will most likely only make people mad and uncooporative.

if you are trying to use Java Lucene, then post your message to java-user
list.  if you are trying to use solr, then post your message to the
solr-user list, etc  If you know you want to do something with Lucene,
but you aren't sure which list ot mail to, email [EMAIL PROTECTED] and ask
what the appropriate list is for your question.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Advices on a replacement of Lucene gap encoding scheme?

2007-02-01 Thread Thang Luong Minh


Dear all

I am happy to send my first email to Lucene community after some time
standing aside, following many interesting discussions.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the fixed binary
coding scheme mentioned in the paper Index compression using Fixed Binary
Codewords by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
here 
http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.htmlhttp://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html).
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the fixed binary scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

PS: for this type of discussion, which mailing list is most appropriate for
my emai?

Best regards,

Luong Minh Thang

bad queryparser bug

2007-02-01 Thread Peter Keegan


I have discovered a serious bug in QueryParser. The following query:
contents:sales  contents:marketing || contents:industrial 
contents:sales

is parsed as:
+contents:sales +contents:marketing +contents:industrial +contents:sales

The same parsed query occurs even with parenthesis:
(contents:sales  contents:marketing) || (contents:industrial 
contents:sales)

Is there any way around this bug?

Thanks,
Peter

Looking for crawler recommendations.

2007-02-01 Thread spamsucks

Has anyone integrated a crawler with lucene that they had success with?  I 
cannot use Nutch, since 60% of our searchable content is contained in a 
database.  I need to do a hybrid between database indexing and website 
crawling.  I would be just crawling one domain with a given set of 
directories.


I found this list of crawlers, but nothing that quite seems to fit my needs. 
One problem with a couple of the libraries that may work is that they use a 
GNU license.

http://www.manageability.org/blog/stuff/open-source-web-crawlers-java/view

Thanks.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

trouble with permissions?

2007-02-01 Thread Miles Efron

i seem to be having a problem analogous to this one (no answer that i  
see):


	http://www.gossamer-threads.com/lists/lucene/java-user/32268? 
search_string=cannot%20overwrite;#32268


trouble is, i just put lucene on my new macbook pro and am having the  
problem that if i build a large index, i get an I/O error due to  
something like


	java.io.IOException: Cannot overwrite: /data/reuters/indexes/reuters/ 
deleteable.new


same code worked fine on my previous machine (still running on a G4  
powerbook and a linux machine).  sometimes it has trouble writing the  
segments file instead...


has anyone seen and solved this problem?  thoughts on what might be  
behind it?


thanks,
-Miles

On Feb 1, 2007, at 2:57 PM, Peter Keegan wrote:


I have discovered a serious bug in QueryParser. The following query:
contents:sales  contents:marketing || contents:industrial 
contents:sales

is parsed as:
+contents:sales +contents:marketing +contents:industrial  
+contents:sales


The same parsed query occurs even with parenthesis:
(contents:sales  contents:marketing) || (contents:industrial 
contents:sales)

Is there any way around this bug?

Thanks,
Peter

Re: trouble with permissions?

2007-02-01 Thread Michael McCandless


Miles Efron wrote:

i seem to be having a problem analogous to this one (no answer that i see):

http://www.gossamer-threads.com/lists/lucene/java-user/32268?search_string=cannot%20overwrite;#32268 



trouble is, i just put lucene on my new macbook pro and am having the 
problem that if i build a large index, i get an I/O error due to 
something like


java.io.IOException: Cannot overwrite: 
/data/reuters/indexes/reuters/deleteable.new


same code worked fine on my previous machine (still running on a G4 
powerbook and a linux machine).  sometimes it has trouble writing the 
segments file instead...


has anyone seen and solved this problem?  thoughts on what might be 
behind it?


Are you running Windows on your macbook pro?

There are known issues like this, but only on Windows, eg:

  http://issues.apache.org/jira/browse/LUCENE-665

We believe such cases are now fixed by lockless commits, on the trunk
of Lucene (which is not yet released).  If you could try the trunk
(but beware that API, file formats, can change) and see if this still
happens that'd be great!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: trouble with permissions?

2007-02-01 Thread Miles Efron


Mike,

You rule.  Swapping out the nightly build seems to have fixed the  
problem... tried it on two problematic cases and both worked.


For the record, I'm running mac os 10.4.8.

Do you know if the lockless commits will be included in the next  
stable release?


Thanks so much!
-Miles

On Feb 1, 2007, at 3:33 PM, Michael McCandless wrote:


Miles Efron wrote:
i seem to be having a problem analogous to this one (no answer  
that i see):
http://www.gossamer-threads.com/lists/lucene/java-user/32268? 
search_string=cannot%20overwrite;#32268 trouble is, i just put  
lucene on my new macbook pro and am having the problem that if i  
build a large index, i get an I/O error due to something like
java.io.IOException: Cannot overwrite: /data/reuters/indexes/ 
reuters/deleteable.new
same code worked fine on my previous machine (still running on a  
G4 powerbook and a linux machine).  sometimes it has trouble  
writing the segments file instead...
has anyone seen and solved this problem?  thoughts on what might  
be behind it?


Are you running Windows on your macbook pro?

There are known issues like this, but only on Windows, eg:

  http://issues.apache.org/jira/browse/LUCENE-665

We believe such cases are now fixed by lockless commits, on the trunk
of Lucene (which is not yet released).  If you could try the trunk
(but beware that API, file formats, can change) and see if this still
happens that'd be great!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: trouble with permissions?

2007-02-01 Thread Michael McCandless


Miles Efron wrote:

You rule.  Swapping out the nightly build seems to have fixed the 
problem... tried it on two problematic cases and both worked.


Phew!


For the record, I'm running mac os 10.4.8.


Uh-oh, I can't explain why you would hit these errors on OS X 10.4.8;
we have only seen these one Windows.

Are you sure switching to trunk has fixed it?  Lockless commits makes
Lucene write once so this works around a number of file system
quirks.  Still it'd be good to get to your root cause.

Is the index stored on a remote (Windows CIFS) mount?  Or is it stored
on a local (Mac OS HFS+) drive?

Do you know if the lockless commits will be included in the next stable 
release?


Yes this will be included in 2.1 -- I think 2.1 will be released soon
(there's been discussions on the dev list to get the release process
started soon).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bad queryparser bug

2007-02-01 Thread Mark Miller

This is a ton of discussion on this if you search the lucene user list 
(QueryParser and precendence and the 'binary' operators). I have seen 
many mentions of the precedence parser still having open issues but no 
mention of what those issues are.


Peter Keegan wrote:
OK, I see that I'm not the first to discover this behavior of 
QueryParser.

Can anyone vouch for the integrity of the PrecedenceQueryParser here:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/ 



Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:


Correction:

The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on
where to look in QueryParser to fix this.

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:

 I have discovered a serious bug in QueryParser. The following query:
 contents:sales  contents:marketing || contents:industrial 
 contents:sales

 is parsed as:
 +contents:sales +contents:marketing +contents:industrial 
+contents:sales



 The same parsed query occurs even with parenthesis:
 (contents:sales  contents:marketing) || (contents:industrial 
 contents:sales)

 Is there any way around this bug?

 Thanks,
 Peter







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: trouble with permissions?

2007-02-01 Thread Miles Efron

I really don't know why os x could have induced those kinds of  
filesystem issues.  i assumed that since i had switched over to the  
intel architecture that perhaps something was going on with the  
JVM...everything involved in the process was mac; local filesystem, etc.


but i'm fairly sure that the trunk code has fixed the problem.  i ran  
two 'offending' bits of code and checked their results.  not only did  
they finish (quite a feat today), but they did so correctly.


-Miles

On Feb 1, 2007, at 4:19 PM, Michael McCandless wrote:


Miles Efron wrote:

You rule.  Swapping out the nightly build seems to have fixed the  
problem... tried it on two problematic cases and both worked.


Phew!


For the record, I'm running mac os 10.4.8.


Uh-oh, I can't explain why you would hit these errors on OS X 10.4.8;
we have only seen these one Windows.

Are you sure switching to trunk has fixed it?  Lockless commits makes
Lucene write once so this works around a number of file system
quirks.  Still it'd be good to get to your root cause.

Is the index stored on a remote (Windows CIFS) mount?  Or is it stored
on a local (Mac OS HFS+) drive?

Do you know if the lockless commits will be included in the next  
stable release?


Yes this will be included in 2.1 -- I think 2.1 will be released soon
(there's been discussions on the dev list to get the release process
started soon).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: trouble with permissions?

2007-02-01 Thread Michael McCandless


Miles Efron wrote:
I really don't know why os x could have induced those kinds of 
filesystem issues.  i assumed that since i had switched over to the 
intel architecture that perhaps something was going on with the 
JVM...everything involved in the process was mac; local filesystem, etc.


but i'm fairly sure that the trunk code has fixed the problem.  i ran 
two 'offending' bits of code and checked their results.  not only did 
they finish (quite a feat today), but they did so correctly.


OK I will keep my fingers crossed that there isn't another issue
lurking :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

searching by field's TF vector (not MoreLikeThis)

2007-02-01 Thread Brian Whitman

I'm looking for a way to search by a field's internal TF vector  
representation.


MoreLikeThis does not seem to be what I want-- it constructs a text  
query based on the top scoring TF-IDF terms. I want to query by TF  
vector directly, bypassing the tokens.


Lucene understandably has knowledge of the cosine dist of these  
vectors -- does it expose it in such a way that I can query the topN  
results from a field's TF vector?









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene Javadoc Exception - cause?

2007-02-01 Thread Josh Joy

Hi,

I was implementing some calls to Lucene, though was
curious if there was 
some documentation I was missing that indicated why a
method throws an 
exception.

Example, IndexReader - deleteDocuments() - what is the
root cause as to 
why it throws IOException?

I'm trying to utilize this info to determine my
exception handling 
strategy for all my Lucene API calls (should I fail,
retry, ignore, etc)

Thanks,
Josh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bad queryparser bug

2007-02-01 Thread Peter Keegan


Correction:

The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on where
to look in QueryParser to fix this.

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:


I have discovered a serious bug in QueryParser. The following query:
contents:sales  contents:marketing || contents:industrial 
contents:sales

is parsed as:
+contents:sales +contents:marketing +contents:industrial +contents:sales

The same parsed query occurs even with parenthesis:
(contents:sales  contents:marketing) || (contents:industrial 
contents:sales)

Is there any way around this bug?

Thanks,
Peter

Re: bad queryparser bug

2007-02-01 Thread Peter Keegan


OK, I see that I'm not the first to discover this behavior of QueryParser.
Can anyone vouch for the integrity of the PrecedenceQueryParser here:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:


Correction:

The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on
where to look in QueryParser to fix this.

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:

 I have discovered a serious bug in QueryParser. The following query:
 contents:sales  contents:marketing || contents:industrial 
 contents:sales

 is parsed as:
 +contents:sales +contents:marketing +contents:industrial +contents:sales


 The same parsed query occurs even with parenthesis:
 (contents:sales  contents:marketing) || (contents:industrial 
 contents:sales)

 Is there any way around this bug?

 Thanks,
 Peter

Re: Lucene Javadoc Exception - cause?

2007-02-01 Thread Erick Erickson


Well, in the normal course of events, things like deleteDocuments(Term)
shouldn't throw an exception unless I've screwed up. In my experience,
Lucene usually gracefully handles normal error cases. In this case, there
not being any underlying documents that match on Term is, I believe, handled
by just returning 0 for the documents deleted.

If the underlying Directory is closed, say, or the index is corrupted or you
might get an error thrown. Neither would be safe to ignore.

Except for a single case in my experience, exceptions are thrown by Lucene
because of a failure that retrying won't solve and ignoring would be a bad
idea. Good programming practice precludes throwing exceptions for
*recoverable* problems.

The only case I can remember where Lucene threw what I thought was an
inappropriate exception was calling a WildcardEnum with a term that had no
wildcard, and that's since been fixed in the trunk.

catching and ignoring errors is not something I'd recommend unless there is
*good* reason to believe that the underlying cause will fix itself. That's
much more common in, say, communications programs where the network can be
flaky than it is in programs like Lucene.

Best
Erick



On 2/1/07, Josh Joy [EMAIL PROTECTED] wrote:


Hi,

I was implementing some calls to Lucene, though was
curious if there was
some documentation I was missing that indicated why a
method throws an
exception.

Example, IndexReader - deleteDocuments() - what is the
root cause as to
why it throws IOException?

I'm trying to utilize this info to determine my
exception handling
strategy for all my Lucene API calls (should I fail,
retry, ignore, etc)

Thanks,
Josh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use of only a prohibit search

2007-02-01 Thread Chris Hostetter


Adding a MatchAllDocsQuery instance to your boolean query if all clauses
are prohibited is in fact still the best way to do a purely negative
query.

the trunk makes this easier by adding MatchAllDocsQuery syntax to the
query parser...

*:* -description:plot



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bad queryparser bug

2007-02-01 Thread Chris Hostetter


please do not cross post questions about using the Lucene API to both the
user and dev mailing lists -- the user list is the correct place to ask
questions about behavior you are seeing that you think may be a bug.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bad queryparser bug

2007-02-01 Thread Chris Hostetter


: The query parser produces the correct query with the parenthesis.
: But, I'm still looking for a fix for this. I could use some advice on where
: to look in QueryParser to fix this.

the best advice i can give you: don't use the binary operators.

  * Lucene is not a boolean logic system
  * BooleanQuery does not impliment boolean logic
  * QueryParser is not a boolean language parser

(If i could go back in time and stop the AND/OR/NOT//|| aliases from
being added to the QueryParser -- i would)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Boost/Scoring question

2007-02-01 Thread Chris Hostetter


: It's the index time boost, rather than query time boost.  This short example
: shows the behaviour of searches for

A... index boosts! ... totally didn't occur to me that was what you
were talking about.  Yes: it makes sense that if you give a field an index
boost of 0.0f you won't be able to query with Hits on that field for that
doc: with a field boost of 0 your fieldNorm is going to be 0, which is
going to make any score from that field a 0.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem with field.setboost(5.0f) on lucene 2.00

2007-02-01 Thread Chris Hostetter


: it still dont make any change on the boost value, for information i use
: luke.jar to see if the value had change

i'm not sure what you mean you're using luke to see if hte value has
changed ... boosts aren't stored in the index (they are used to compute a
fieldNorm) so there's nothing for luke to show you (but i haven't played
with luke extensively so i'm not sure what you might be looking at that
relates to boosts)

please note the javadocs for setBoost and getBoost...

public void setBoost(float boost)

Sets the boost factor hits on this field. This value will be
multiplied into the score of all hits on this this field of this document.

The boost is multiplied by Document.getBoost() of the document
containing this field. If a document has multiple fields with the same
name, all such values are multiplied together. This product is then
multipled by the value Similarity.lengthNorm(String,int), and rounded by
Similarity.encodeNorm(float) before it is stored in the index. One should
attempt to ensure that this product does not overflow the range of that
encoding.

public float getBoost()

Returns the boost factor for hits for this field.

The default value is 1.0.

Note: this value is not stored directly with the document in the
index. Documents returned from IndexReader.document(int) and Hits.doc(int)
may thus not have the same value present as when this field was indexed.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

38 matches

Mail list logo