Frequency of phrase

2006-02-23 Thread Eric Jain
This is somewhat related to a question sent to this list a while ago: Is 
there an efficient way to count the number of occurrences of a phrase (not 
term) in an index?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I get a term's frequency?

2006-02-23 Thread Grant Ingersoll
You need to make sure you are indexing with Term Vectors in order for
IndexReader.getTermFreqVector to return anything meaningful. You do not
need to implement it.

QueryTermVector is meant to provide similar information to the Document
side for Queries.

For an example demo of indexing and using term vectors, go to
http://www.cnlp.org/apachecon2005. All the examples are under Apache
license and there is some documentation too.

-Grant

Daniel Noll wrote:
 sog wrote:
   
 en, but IndexReader.getTermFreqVector is an abstract method, I do not 
 know how to implement it in an efficient way. Anyone has good advise?
 

 You probably don't need to implement it, it's been implemented already.
  Just call the method.

   
 I can do it in this way:

 QueryTermVector vector= new QueryTermVector(Document.getValues(field));
 freq = result.getTermFrequencies();
 

 I'm not sure because I've never used QueryTermVector before, but the
 fact that QueryTermVector doesn't take an IndexReader as a parameter is
 a good indication that it can't tell you anything about the frequency of
 the term in your documents.

 Daniel




   

-- 
--- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Raghavendra Prabhu
Can nutch be made to use lucene query parser?

Rgds
Prabhu


On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Hi Otis,

 The Lucene server is actually CPU and network bound, as the index gets
 memory mapped pretty quickly. There is little disk activity observed.

 I was also able to run the server on a Sun box last night with 4 dual core
 opterons (same Linux and JVM) and I'm observing query rates of 400 qps!
 Has
 Linux been optimized to run on this hardware? I imagine that Sun's JVM has
 been.

 Peter

 On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 
  Hi,
 
  Some things that could be different:
  - thread scheduling (shouldn't make too much of a difference though)
 
  --- I would also play with disk IO schedulers, if you can.  CentOS is
  based on RedHat, I believe, and RedHat (ext3, really) now has about 4
  different IO schedulers that, according to articles I recently read, can
  have an impact on disk read/write performance.  These schedules can be
  specified at mount time, I believe, and maybe at boot time (kernel line
 in
  Grub/LILO).
 
  Otis
 
 
  On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
   I am doing a performance comparison of Lucene on Linux vs Windows.
  
   I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
   processors, 64GB RAM). One is running CentOS 4 Linux, the other is
  running
   Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from
  Sun.
   The Lucene server is using MMapDirectory. I'm running the jvm with
   -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and
 7.8GBon
   windows.
  
   I'm observing query rates of 330 queries/sec on the Wintel server, but
  only
   200 qps on the Linux box. At first, I suspected a network bottleneck,
  but
   when I 'short-circuited' Lucene, the query rates were identical.
  
   I suspect that there are some things to be tuned in Linux, but I'm not
  sure
   what. Any advice would be appreciated.
  
   Peter
  
  
  
   On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:
   
I cranked up the dial on my query tester and was able to get the
 rate
  up
to 325 qps. Unfortunately, the machine died shortly thereafter
 (memory
errors :-( ) Hopefully, it was just a coincidence. I haven't
 measured
  64-bit
indexing speed, yet.
   
Peter
   
On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and
 I'm
  now
  getting 250 queries/sec and excellent cpu utilization (equal
 concurrency on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
  wasn't
 aware
  of it.
 
 Wow.  That's fast.

 Out of interest, does indexing time speed up much on 64-bit
  hardware?
 I'm particularly interested in this side of things because for our
  own
 application, any query response under half a second is good
 enough,
  but
 the indexing side could always be faster. :-)

 Daniel

 --
 Daniel Noll

 Nuix Australia Pty Ltd
 Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
 Phone: (02) 9280 0699
 Fax:   (02) 9212 6902

 This message is intended only for the named recipient. If you are
  not
 the intended recipient you are notified that disclosing, copying,
 distributing or taking any action in reliance on the contents of
  this
 message or attachment is strictly prohibited.



  -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 




Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Dan Armbrust
I would give the IBM or blackdown JVM a try on linux - I've seen pretty 
wide variance in their speed on different operations.


Sometimes better than Sun, sometimes worse - it depended on the task (I 
did some adhoc tests at one point that showed sun was faster for 
indexing, but IBM was faster for querying - but that was quite a while ago.


Dan


--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SQL DISTINCT functionality in Lucene

2006-02-23 Thread Hugh Ross
Hi,

I need to find all distinct values for a keyword field in a Lucene index.

 

Is this easily done? If so how?

 

Many thanks,

Hugh



Hierarchical Navigation in Lucene

2006-02-23 Thread Hugh Ross
Hi,

We have a custom built document repository which is searchable / indexed via
lucene.

I want to put together some kind of navigation framework based on the
repository metadata (which is also indexed with lucene). 


Is there a best-practice way to do this.?

 

Thanks,

Hugh

 

 



RE: SQL DISTINCT functionality in Lucene

2006-02-23 Thread Hugh Ross
Many Thanks. 
Hugh

-Original Message-
From: Michael D. Curtin [mailto:[EMAIL PROTECTED] 
Sent: 23 February 2006 17:39
To: java-user@lucene.apache.org
Subject: Re: SQL DISTINCT functionality in Lucene

Hugh Ross wrote:


 I need to find all distinct values for a keyword field in a Lucene index.

I think the IndexReader.terms() method will do what you want.  Good luck!

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: search a subdirectory (New to Lucene)

2006-02-23 Thread John Hamilton
I reindexed with the path as a keyword field and now the PrefixQuery filter 
does exactly what I need.  Thanks!

I'm going to hold off on the paragraph-level indexing for now, but that does 
sound interesting.

many thanks,

John

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 22, 2006 3:18 PM
To: java-user@lucene.apache.org
Subject: Re: search a subdirectory (New to Lucene)


I presume by saying subdirectory you're referring to filesystem  
directories and you're indexing a directory tree of files.   If you  
index the path (perhaps relative from the root is best) as a keyword  
field (untokenized, but indexed) you could perform filtering on a / 
path/subpath sort of way using PrefixQuery.

As for paragraphs - how you index a document is entirely  
application dependent.  Maybe it makes sense to parse the documents  
before handing them to Lucene such that you're creating a Lucene  
Document for each paragraph rather than for each entire file.   
Slicing the granularity of a domain into Documents is a fascinating  
topic :)

Erik


On Feb 22, 2006, at 1:00 PM, John Hamilton wrote:

 I'm new to Lucene and was wondering what is the best way to perform  
 a search on a subdirectory or subdirectories within the index?  My  
 thought at this point is to build a query to first search for files  
 in the required directory(ies) and then use that query to make a  
 QueryFilter and use that QueryFilter in the actual search.  Is  
 there an easier way?

 On an unrelated note, does anybody know of a way to get results a  
 the section level within a document?  For example, could I find not  
 just a document that matches my query, but the paragraph within  
 that document that best matches the query?

 thanks,

 John


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Otis Gospodnetic
Hi,

Please ask on the Nutch mailing list (I answered your question in general@ 
already).
Also, please don't steal other people's threads - it's considered inpolite for 
obvious reasons.

Otis


- Original Message 
From: Raghavendra Prabhu [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, February 23, 2006 11:10:11 AM
Subject: Re: Throughput doesn't increase when using more concurrent threads

Can nutch be made to use lucene query parser?

Rgds
Prabhu


On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Hi Otis,

 The Lucene server is actually CPU and network bound, as the index gets
 memory mapped pretty quickly. There is little disk activity observed.

 I was also able to run the server on a Sun box last night with 4 dual core
 opterons (same Linux and JVM) and I'm observing query rates of 400 qps!
 Has
 Linux been optimized to run on this hardware? I imagine that Sun's JVM has
 been.

 Peter

 On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 
  Hi,
 
  Some things that could be different:
  - thread scheduling (shouldn't make too much of a difference though)
 
  --- I would also play with disk IO schedulers, if you can.  CentOS is
  based on RedHat, I believe, and RedHat (ext3, really) now has about 4
  different IO schedulers that, according to articles I recently read, can
  have an impact on disk read/write performance.  These schedules can be
  specified at mount time, I believe, and maybe at boot time (kernel line
 in
  Grub/LILO).
 
  Otis
 
 
  On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
   I am doing a performance comparison of Lucene on Linux vs Windows.
  
   I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
   processors, 64GB RAM). One is running CentOS 4 Linux, the other is
  running
   Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from
  Sun.
   The Lucene server is using MMapDirectory. I'm running the jvm with
   -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and
 7.8GBon
   windows.
  
   I'm observing query rates of 330 queries/sec on the Wintel server, but
  only
   200 qps on the Linux box. At first, I suspected a network bottleneck,
  but
   when I 'short-circuited' Lucene, the query rates were identical.
  
   I suspect that there are some things to be tuned in Linux, but I'm not
  sure
   what. Any advice would be appreciated.
  
   Peter
  
  
  
   On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:
   
I cranked up the dial on my query tester and was able to get the
 rate
  up
to 325 qps. Unfortunately, the machine died shortly thereafter
 (memory
errors :-( ) Hopefully, it was just a coincidence. I haven't
 measured
  64-bit
indexing speed, yet.
   
Peter
   
On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and
 I'm
  now
  getting 250 queries/sec and excellent cpu utilization (equal
 concurrency on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
  wasn't
 aware
  of it.
 
 Wow.  That's fast.

 Out of interest, does indexing time speed up much on 64-bit
  hardware?
 I'm particularly interested in this side of things because for our
  own
 application, any query response under half a second is good
 enough,
  but
 the indexing side could always be faster. :-)

 Daniel

 --
 Daniel Noll

 Nuix Australia Pty Ltd
 Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
 Phone: (02) 9280 0699
 Fax:   (02) 9212 6902

 This message is intended only for the named recipient. If you are
  not
 the intended recipient you are notified that disclosing, copying,
 distributing or taking any action in reliance on the contents of
  this
 message or attachment is strictly prohibited.



  -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Raghavendra Prabhu
Hi

Sorry for the trouble

I was sending my first mail to the group

and replied to this thread and then later on sent a direct mail.

I would like to apologise for the inconvenience caused.

Rgds
Prabhu


On 2/23/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Hi,

 Please ask on the Nutch mailing list (I answered your question in general@
 already).
 Also, please don't steal other people's threads - it's considered inpolite
 for obvious reasons.

 Otis


 - Original Message 
 From: Raghavendra Prabhu [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Thursday, February 23, 2006 11:10:11 AM
 Subject: Re: Throughput doesn't increase when using more concurrent
 threads

 Can nutch be made to use lucene query parser?

 Rgds
 Prabhu


 On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
 
  Hi Otis,
 
  The Lucene server is actually CPU and network bound, as the index gets
  memory mapped pretty quickly. There is little disk activity observed.
 
  I was also able to run the server on a Sun box last night with 4 dual
 core
  opterons (same Linux and JVM) and I'm observing query rates of 400 qps!
  Has
  Linux been optimized to run on this hardware? I imagine that Sun's JVM
 has
  been.
 
  Peter
 
  On 2/22/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
  
   Hi,
  
   Some things that could be different:
   - thread scheduling (shouldn't make too much of a difference though)
  
   --- I would also play with disk IO schedulers, if you can.  CentOS is
   based on RedHat, I believe, and RedHat (ext3, really) now has about 4
   different IO schedulers that, according to articles I recently read,
 can
   have an impact on disk read/write performance.  These schedules can be
   specified at mount time, I believe, and maybe at boot time (kernel
 line
  in
   Grub/LILO).
  
   Otis
  
  
   On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
I am doing a performance comparison of Lucene on Linux vs Windows.
   
I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
processors, 64GB RAM). One is running CentOS 4 Linux, the other is
   running
Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs
 from
   Sun.
The Lucene server is using MMapDirectory. I'm running the jvm with
-Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and
  7.8GBon
windows.
   
I'm observing query rates of 330 queries/sec on the Wintel server,
 but
   only
200 qps on the Linux box. At first, I suspected a network
 bottleneck,
   but
when I 'short-circuited' Lucene, the query rates were identical.
   
I suspect that there are some things to be tuned in Linux, but I'm
 not
   sure
what. Any advice would be appreciated.
   
Peter
   
   
   
On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:

 I cranked up the dial on my query tester and was able to get the
  rate
   up
 to 325 qps. Unfortunately, the machine died shortly thereafter
  (memory
 errors :-( ) Hopefully, it was just a coincidence. I haven't
  measured
   64-bit
 indexing speed, yet.

 Peter

 On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
 
  Peter Keegan wrote:
   I tried the AMD64-bit JVM from Sun and with MMapDirectory and
  I'm
   now
   getting 250 queries/sec and excellent cpu utilization (equal
  concurrency on
   all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
   wasn't
  aware
   of it.
  
  Wow.  That's fast.
 
  Out of interest, does indexing time speed up much on 64-bit
   hardware?
  I'm particularly interested in this side of things because for
 our
   own
  application, any query response under half a second is good
  enough,
   but
  the indexing side could always be faster. :-)
 
  Daniel
 
  --
  Daniel Noll
 
  Nuix Australia Pty Ltd
  Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
  Phone: (02) 9280 0699
  Fax:   (02) 9212 6902
 
  This message is intended only for the named recipient. If you
 are
   not
  the intended recipient you are notified that disclosing,
 copying,
  distributing or taking any action in reliance on the contents of
   this
  message or attachment is strictly prohibited.
 
 
 
   -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

   
   
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
We discovered that the kernel was only using 8 CPUs. After recompiling for
16 (8+hyperthreads), it looks like the query rate will settle in around
280-300 qps. Much better, although still quite a bit slower than the
opteron.

Peter




On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Hmmm, not sure what that could be.
 You could try using the default FSDir instead of MMapDir to see if the
 differences are there.

 Some things that could be different:
 - thread scheduling (shouldn't make too much of a difference though)
 - synchronization workings
 - page replacement policy... how to figure out what pages to swap in
 and which to swap out, esp of the memory mapped files.

 You could also try a profiler on both platforms to try and see where
 the difference is.

 -Yonik

 On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
  I am doing a performance comparison of Lucene on Linux vs Windows.
 
  I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
  processors, 64GB RAM). One is running CentOS 4 Linux, the other is
 running
  Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from
 Sun.
  The Lucene server is using MMapDirectory. I'm running the jvm with
  -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon
  windows.
 
  I'm observing query rates of 330 queries/sec on the Wintel server, but
 only
  200 qps on the Linux box. At first, I suspected a network bottleneck,
 but
  when I 'short-circuited' Lucene, the query rates were identical.
 
  I suspect that there are some things to be tuned in Linux, but I'm not
 sure
  what. Any advice would be appreciated.
 
  Peter
 
 
 
  On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   I cranked up the dial on my query tester and was able to get the rate
 up
   to 325 qps. Unfortunately, the machine died shortly thereafter (memory
   errors :-( ) Hopefully, it was just a coincidence. I haven't measured
 64-bit
   indexing speed, yet.
  
   Peter
  
   On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
   
Peter Keegan wrote:
 I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm
 now
 getting 250 queries/sec and excellent cpu utilization (equal
concurrency on
 all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
 wasn't
aware
 of it.

Wow.  That's fast.
   
Out of interest, does indexing time speed up much on 64-bit
 hardware?
I'm particularly interested in this side of things because for our
 own
application, any query response under half a second is good enough,
 but
the indexing side could always be faster. :-)
   
Daniel
   
--
Daniel Noll
   
Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902
   
This message is intended only for the named recipient. If you are
 not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of
 this
message or attachment is strictly prohibited.
   
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Hierarchical Navigation in Lucene

2006-02-23 Thread Erik Hatcher


On Feb 23, 2006, at 12:37 PM, Hugh Ross wrote:


Hi,

We have a custom built document repository which is searchable /  
indexed via

lucene.

I want to put together some kind of navigation framework based on the
repository metadata (which is also indexed with lucene).


Is there a best-practice way to do this.?



I don't know about a best practice, but I've used term enumeration  
coupled with PrefixQuery's to enable hierarchical navigation on my  
(very dusty and way outdated) blog: http://www.blogscene.org/erik


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching/sorting strategy for many properties for semantic web app

2006-02-23 Thread Erik Hatcher


On Feb 22, 2006, at 9:01 PM, David Pratt wrote:
Hi Erik. Many thanks for your reply. I'll likely see if I can find  
a list to pose a couple of questions there way. I am having fun  
with Lucene since it is new to me and I am impressed with the speed  
I am getting. I am reading anything I can get hold of and trying  
different code experiments. So far, the code is fairly straight  
forward so not so concerned about this at the moment.


I am really hoping to hear from experienced people like yourself  
more on strategically what to index, what sort of things it would  
be a good idea to store and what to do about a fairly large schema  
that has much metadata to offer. Also perhaps when sorting and  
filtering gets too expensive. I realize that just because the  
metadata is available doesn't necessarily mean you want to even put  
it all in an index. I think these issues are pretty general,  
however I know there are folks on this that would likely advise  
some particular path or direction because of their own experiences  
with Lucene. I would really like to hear from anyone that has been  
working with metadata particularly or anyone generally about these  
topics.


In my University job, I'm dealing with a fair bit of metadata in the  
form of RDF about 19th century literature objects.  I'm indexing  
basic Dublin Core data such as title and author as individual fields,  
and also dropping all indexed metadata into a single searchable  
field.  I've been using Kowari as the metadata store, but it also has  
Lucene integration (that I've not tried myself yet).


I'm not sure what else to add as your query is a bit general.  I  
think you'll find if you post more specific questions you're more  
likely to get detailed responses.  General queries tend to be too  
general to respond to, I find.


There really are no best practices with Lucene in terms of what to  
index, what to store - these are all highly application dependent and  
is often something I tune as the application itself evolves.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Chris,

I tried JRockit a while back on 8-cpu/windows and it was slower than Sun's.
Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next
(32 with hyperthreading), on LinTel. I may give JRockit another go around
then.

Thanks,
Peter

On 2/23/06, Chris Lamprecht [EMAIL PROTECTED] wrote:

 Peter,
 Have you given JRockit JVM a try?  I've seen it help throughput
 compared to Sun's JVM on a dual xeon/linux machine, especially with
 concurrency (up to 6 concurrent searches happening).  I'm curious to
 see if it makes a difference for you.

 -chris

 On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
  We discovered that the kernel was only using 8 CPUs. After recompiling
 for
  16 (8+hyperthreads), it looks like the query rate will settle in around
  280-300 qps. Much better, although still quite a bit slower than the
  opteron.
 
  Peter
 
 
 
 
  On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote:
  
   Hmmm, not sure what that could be.
   You could try using the default FSDir instead of MMapDir to see if the
   differences are there.
  
   Some things that could be different:
   - thread scheduling (shouldn't make too much of a difference though)
   - synchronization workings
   - page replacement policy... how to figure out what pages to swap in
   and which to swap out, esp of the memory mapped files.
  
   You could also try a profiler on both platforms to try and see where
   the difference is.
  
   -Yonik
  
   On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
I am doing a performance comparison of Lucene on Linux vs Windows.
   
I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
processors, 64GB RAM). One is running CentOS 4 Linux, the other is
   running
Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs
 from
   Sun.
The Lucene server is using MMapDirectory. I'm running the jvm with
-Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and
 7.8GBon
windows.
   
I'm observing query rates of 330 queries/sec on the Wintel server,
 but
   only
200 qps on the Linux box. At first, I suspected a network
 bottleneck,
   but
when I 'short-circuited' Lucene, the query rates were identical.
   
I suspect that there are some things to be tuned in Linux, but I'm
 not
   sure
what. Any advice would be appreciated.
   
Peter
   
   
   
On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:

 I cranked up the dial on my query tester and was able to get the
 rate
   up
 to 325 qps. Unfortunately, the machine died shortly thereafter
 (memory
 errors :-( ) Hopefully, it was just a coincidence. I haven't
 measured
   64-bit
 indexing speed, yet.

 Peter

 On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
 
  Peter Keegan wrote:
   I tried the AMD64-bit JVM from Sun and with MMapDirectory and
 I'm
   now
   getting 250 queries/sec and excellent cpu utilization (equal
  concurrency on
   all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
   wasn't
  aware
   of it.
  
  Wow.  That's fast.
 
  Out of interest, does indexing time speed up much on 64-bit
   hardware?
  I'm particularly interested in this side of things because for
 our
   own
  application, any query response under half a second is good
 enough,
   but
  the indexing side could always be faster. :-)
 
  Daniel
 
  --
  Daniel Noll
 
  Nuix Australia Pty Ltd
  Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
  Phone: (02) 9280 0699
  Fax:   (02) 9212 6902
 
  This message is intended only for the named recipient. If you
 are
   not
  the intended recipient you are notified that disclosing,
 copying,
  distributing or taking any action in reliance on the contents of
   this
  message or attachment is strictly prohibited.
 
 
 
   -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

   
   
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Getting no hits ...

2006-02-23 Thread Mufaddal Khumri
I have been trying to figure out why my query below would not return any 
hits.


I use two custom analyzers for indexing and searching. The one I use for 
indexing uses this:


   public TokenStream tokenStream(String fieldName, Reader reader)
   {
   TokenStream result = new StandardTokenizer(reader);
   result = new StandardFilter(result);
   result = new LowerCaseFilter(result);
   result = new StopFilter(result, stopSet);
   result = new SynonymFilter(result, new MySynonymEngine());
   result = new PorterStemFilter(result);
   return result;
   }

The one I use for searching uses this:

   public TokenStream tokenStream(String fieldName, Reader reader)
   {
   TokenStream result = new StandardTokenizer(reader);
   result = new StandardFilter(result);
   result = new LowerCaseFilter(result);
   result = new StopFilter(result, stopSet);
   result = new PorterStemFilter(result);
   return result;
   }

(Basically while searching I do not use the SynonymFilter.)

I have quite a few products that I index that have the text on which I 
am querying on.


I do a search for this: ES-20D

This is the final query that I run:
+(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0 
((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d) 
categoryName:es\-20d^80.0)


(The content and title fields are Indexed, Tokenized and Stored. The 
categoryName field is Indexed and Stored.)


I get no hits?

Where am i going wrong with this? Any pointers?

-Thanks.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Yonik Seeley
Wow, some resources!
Would it be cheaper / more scalable to copy the index to multiple
boxes and loadbalance requests across them?

-Yonik

On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
 Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next
 (32 with hyperthreading), on LinTel. I may give JRockit another go around
 then.

 Thanks,
 Peter

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Yonik,

We're investigating both approaches.
Yes, the resources (and permutations) are dizzying!

Peter

On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Wow, some resources!
 Would it be cheaper / more scalable to copy the index to multiple
 boxes and loadbalance requests across them?

 -Yonik

 On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system
 next
  (32 with hyperthreading), on LinTel. I may give JRockit another go
 around
  then.
 
  Thanks,
  Peter

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Getting no hits ...

2006-02-23 Thread Chris Hostetter

1) Have you looked at what tokens your indexing analyzer produces when you
   tokenize ES-20D ?
2) Have you looked at what tokens your query analyser products when you
   tokenize ES-20D ?
3) Have you tried a simpler query (ie: just content:es\-20d ) ?
4) When giving QueryParser a (quoted) phrase search, i don't think you
   really want to escape that - character.



: Date: Thu, 23 Feb 2006 14:16:42 -0700
: From: Mufaddal Khumri [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Getting no hits ...
:
: I have been trying to figure out why my query below would not return any
: hits.
:
: I use two custom analyzers for indexing and searching. The one I use for
: indexing uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new SynonymFilter(result, new MySynonymEngine());
: result = new PorterStemFilter(result);
: return result;
: }
:
: The one I use for searching uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new PorterStemFilter(result);
: return result;
: }
:
: (Basically while searching I do not use the SynonymFilter.)
:
: I have quite a few products that I index that have the text on which I
: am querying on.
:
: I do a search for this: ES-20D
:
: This is the final query that I run:
: +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0
: ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d)
: categoryName:es\-20d^80.0)
:
: (The content and title fields are Indexed, Tokenized and Stored. The
: categoryName field is Indexed and Stored.)
:
: I get no hits?
:
: Where am i going wrong with this? Any pointers?
:
: -Thanks.
:
:
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting no hits ...

2006-02-23 Thread Mufaddal Khumri
In my earlier email i put in the wrong query that I am searching on. The 
correct query is: EOS-20D


And this is the query under question that is producing no hits still:

+(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 
((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) 
categoryName:eos\-20d^80.0)


I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, 
string); (AnalyzerUtils from the LIA book).


This is part of the log output from using the 
AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this 
product gets indexed:



119: [013803044430:857-869:ALPHANUM]
120: [eos-20d:870-877:NUM]
121: [011-eos-20d:878-889:NUM]

This is part of the log output from using the 
AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do 
the search:

1: [eos-20d:0-6:NUM]

From what I understand I see that the analyzer is producing the same 
tokens while indexing and during searching.


Chris Hostetter wrote:


1) Have you looked at what tokens your indexing analyzer produces when you
  tokenize ES-20D ?
2) Have you looked at what tokens your query analyser products when you
  tokenize ES-20D ?
3) Have you tried a simpler query (ie: just content:es\-20d ) ?
4) When giving QueryParser a (quoted) phrase search, i don't think you
  really want to escape that - character.



: Date: Thu, 23 Feb 2006 14:16:42 -0700
: From: Mufaddal Khumri [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Getting no hits ...
:
: I have been trying to figure out why my query below would not return any
: hits.
:
: I use two custom analyzers for indexing and searching. The one I use for
: indexing uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new SynonymFilter(result, new MySynonymEngine());
: result = new PorterStemFilter(result);
: return result;
: }
:
: The one I use for searching uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new PorterStemFilter(result);
: return result;
: }
:
: (Basically while searching I do not use the SynonymFilter.)
:
: I have quite a few products that I index that have the text on which I
: am querying on.
:
: I do a search for this: ES-20D
:
: This is the final query that I run:
: +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0
: ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d)
: categoryName:es\-20d^80.0)
:
: (The content and title fields are Indexed, Tokenized and Stored. The
: categoryName field is Indexed and Stored.)
:
: I get no hits?
:
: Where am i going wrong with this? Any pointers?
:
: -Thanks.
:
:
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ArrayIndexOutOfBounds being thrown ...

2006-02-23 Thread Stephen Gray

Hi everyone,

Sorry for not replying to original post (from Muffadal Khumri, 22/2) - I'm 
new to the list.


I also had this problem, but it seems not to be in the source - downloading 
and building the1.9-rc1 source fixed the problem for me.


Steve


Stephen Gray
Archive Research Officer
Australian Social Science Data Archive
18 Balmain Crescent (Building #66)
The Australian National University
Canberra ACT 0200

Phone +61 2 6125 2185
Fax +61 2 6125 0627
Web http://assda.anu.edu.au/


Re: Getting no hits ...

2006-02-23 Thread Mufaddal Khumri

Follow up on my previous email ...

When I execute this query using luke using the standard analyzer on the 
same index, i get 8 hits.
+(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 
((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) 
categoryName:eos\-20d^80.0)


I modified my searching code to use the standard analyzer, but i did not 
get any hits back. I am still trying to figure out the problem out. Any 
ideas?


Mufaddal Khumri wrote:

In my earlier email i put in the wrong query that I am searching on. 
The correct query is: EOS-20D


And this is the query under question that is producing no hits still:

+(+content:eos\-20d) +entity:product +(title:eos\-20d~2^40.0 
((title:eos\-20d)^10.0) content:eos\-20d~2^20.0 (content:eos\-20d) 
categoryName:eos\-20d^80.0)


I have used the AnalyzerUtils.displayTokensWithFullDetails(analyzer, 
string); (AnalyzerUtils from the LIA book).


This is part of the log output from using the 
AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when this 
product gets indexed:



119: [013803044430:857-869:ALPHANUM]
120: [eos-20d:870-877:NUM]
121: [011-eos-20d:878-889:NUM]

This is part of the log output from using the 
AnalyzerUtils.displayTokensWithFullDetails(analyzer, string) when I do 
the search:

1: [eos-20d:0-6:NUM]

From what I understand I see that the analyzer is producing the same 
tokens while indexing and during searching.


Chris Hostetter wrote:

1) Have you looked at what tokens your indexing analyzer produces 
when you

  tokenize ES-20D ?
2) Have you looked at what tokens your query analyser products when you
  tokenize ES-20D ?
3) Have you tried a simpler query (ie: just content:es\-20d ) ?
4) When giving QueryParser a (quoted) phrase search, i don't think you
  really want to escape that - character.



: Date: Thu, 23 Feb 2006 14:16:42 -0700
: From: Mufaddal Khumri [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Getting no hits ...
:
: I have been trying to figure out why my query below would not 
return any

: hits.
:
: I use two custom analyzers for indexing and searching. The one I 
use for

: indexing uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new SynonymFilter(result, new MySynonymEngine());
: result = new PorterStemFilter(result);
: return result;
: }
:
: The one I use for searching uses this:
:
: public TokenStream tokenStream(String fieldName, Reader reader)
: {
: TokenStream result = new StandardTokenizer(reader);
: result = new StandardFilter(result);
: result = new LowerCaseFilter(result);
: result = new StopFilter(result, stopSet);
: result = new PorterStemFilter(result);
: return result;
: }
:
: (Basically while searching I do not use the SynonymFilter.)
:
: I have quite a few products that I index that have the text on which I
: am querying on.
:
: I do a search for this: ES-20D
:
: This is the final query that I run:
: +(+content:es\-20d) +entity:product +(title:es\-20d~2^40.0
: ((title:es\-20d)^10.0) content:es\-20d~2^20.0 (content:es\-20d)
: categoryName:es\-20d^80.0)
:
: (The content and title fields are Indexed, Tokenized and Stored. The
: categoryName field is Indexed and Stored.)
:
: I get no hits?
:
: Where am i going wrong with this? Any pointers?
:
: -Thanks.
:
:
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



phrase frequency??

2006-02-23 Thread sog


I searched my question in the mail archive, and found that I really want to 
get a phrase frequency, it is an old question which was not solved well.


I traced Lucene source code, and discover that I can get a phrase's IDF from 
the Hits object


weight= PhraseQuery$PhraseWeight  (id=62)
idf= 8.3973465
queryNorm= 0.11908524
queryWeight= 1.0
similarity= DefaultSimilarity  (id=66)
this$0= PhraseQuery  (id=29)
value= 8.3973465

and we can get an approximate formula: score = tf * idf

so: tf(phrase)= score / idf(phrase)


is this correct?



- Original Message - 
From: Daniel Noll [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Thursday, February 23, 2006 8:57 AM
Subject: Re: How can I get a term's frequency?



sog wrote:

en, but IndexReader.getTermFreqVector is an abstract method, I do not
know how to implement it in an efficient way. Anyone has good advise?


You probably don't need to implement it, it's been implemented already.
Just call the method.


I can do it in this way:

QueryTermVector vector= new QueryTermVector(Document.getValues(field));
freq = result.getTermFrequencies();


I'm not sure because I've never used QueryTermVector before, but the
fact that QueryTermVector doesn't take an IndexReader as a parameter is
a good indication that it can't tell you anything about the frequency of
the term in your documents.

Daniel




--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]