from:"Eks Dev \(JIRA\)"

[jira] [Commented] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-11 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129950#comment-14129950
 ] 

Eks Dev commented on LUCENE-5938:
-

Just a crazy idea.   Do you need to store words with all bits set? Did not look 
into implementation, but from your description it sounds like it might be as 
well possible to not store them without adding to many if-s at execution path. 
This way, it wold work better also for dense BS (like implicit inverting 
trick), and for all intermidate cases where you have some partial sorting (some 
sort of run length encoding)? 


 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-03 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119977#comment-14119977
 ] 

Eks Dev commented on LUCENE-5914:
-

lovely, thanks for explaining, I expected something like this but was not 100% 
sure without looking into code. 
Simply, I see absolutely nothing ono might wish from general, OOTB compression 
support... 

In theory...
The only meaningful enhancements to the standard are possible to come only by 
modelling semantics of the data (the user must know quite a bit about the 
distribution of the data) to improve compression/speed = but this cannot be 
provided by the core, (Lucene is rightly content agnostic), at most the core 
APIs might make it more or less comfortable, but imo nothing more. 

For example (contrived as LZ4 would deal with it quite ok, just to illustrate), 
if I know that my field contains up to 5 distinct string values, I might  add 
simple dictionary coding to use max one byte without even going to codec level. 
The only place where I see theoretical possibility to need to go down-dirty is 
if I would want to reach sub-byte representations (3 bits per value in 
example), but this is rarely needed/hard to beat default LZ4/deflate and also 
even harder not to make slow. At the end of a day, someone who needs this type 
of specialisation should be able to write his own per-field codec.

Great work, and thanks again!

 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-02 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117986#comment-14117986
 ] 

Eks Dev commented on LUCENE-5914:
-

bq. Do you have pointers to emails/irc logs describing such issues?

I do not know what the gold standard lucene usage is, but at least one use case 
I can describe, maybe it helps. I am not proposing anything here, just sharing 
experience. 

Think about the (typical lucene?) usage with structured data (e.g. indexing 
relational db, like product catalog or such) with many smallish fields and then 
retrieving 2k such documents to post-process them, classify, cluster them or 
whatnot (e.g. mahout and co.) 

- Default compression with CHUNK_SIZE makes it decompress 2k * CHUNK_SIZE/2  
bytes on average in order to retrieve 2k Documents 
- Reducing chunk_size helps a lot, but there is a sweet-spot, and if you reduce 
it too much, you will not see enough compression and then your index is not 
fitting into cache , so you get hurt on IO. 

Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Usually, such data is highly compressible (imagine all these low cardinality 
fields like color of something...) and even some basic compression does the 
magic.

What we did?
- Reduced chunk_size
- As a bonus to improve compression, added plain static dictionary compression 
for a few fields in update chain (we store analysed fields)
- When applicable, we pre-sort collection periodically before indexing (on low 
cardinality fields first) this old db-admin secret weapon helps a lot

Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort)  in different phases indexing/retrieving smoothly 
makes lucene use case coverage broader.  (e.g. I want to afford more CPU 
during indexing, and less CPU during retrieval, static coder being extreme 
case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 

In any case, thanks for doing the heavy lifting here! I think you already did 
really great job with compression in lucene. 

PS: Ages ago, before lucene, when memory was really expensive, we had our own 
serialization (not in lucene) that simply had one static Huffman coder per 
field (with byte or word symbols), with code-table populated offline,  that was 
great, simple option as it enabled reasonable compression for slow changing 
collections and really fast random access.  
 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718785#comment-13718785
 ] 

Eks Dev commented on SOLR-5069:
---

wow, this is getting pretty close to collection clustering and other candies, 
somehow to plug-in mahout and it's there

Great job and great direction for solr. End-applications not only need to find 
things, they often want to do something with them as well :)

Thanks!   

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615006#comment-13615006
 ] 

Eks Dev commented on LUCENE-4872:
-

the same pattern like Simon here, just having these terms wrapped in 
fuzzy/prefix query, often as dismax query. 

for example:
BQ(boo* OR hoo* OR whatever) with e.g. minShouldMatch = 2  

So the only diff to Simon's case is that single boolean clauses are often more 
complicated then simple TermQuery 


 BooleanWeight should decide how to execute minNrShouldMatch
 ---

 Key: LUCENE-4872
 URL: https://issues.apache.org/jira/browse/LUCENE-4872
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: core/search
Reporter: Robert Muir
 Fix For: 5.0, 4.3

 Attachments: crazyMinShouldMatch.tasks


 LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
 which can use advance() behind the scenes. 
 In cases where you have some really common terms and some rare ones this can 
 be a huge performance improvement.
 On the other hand BooleanScorer might still be faster in some cases.
 We should think about what the logic should be here: one simple thing to do 
 is to always use the new scorer when minShouldMatch is set: thats where i'm 
 leaning. 
 But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs

2013-02-04 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570663#comment-13570663
 ] 

Eks Dev commented on LUCENE-3918:
-

this is the right way to give some really good meaning to venerable optimize 
call :)

We were, and are sorting our data before indexing just to achieve exactly this, 
improvement in locality of reference. Depending on data (has to be somehow 
sortable, e.g. hierarchical structure, on url...), speedup (and likely 
compression Adrian made) gains are sometimes unbelievable...  


 Port index sorter to trunk APIs
 ---

 Key: LUCENE-3918
 URL: https://issues.apache.org/jira/browse/LUCENE-3918
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
 Fix For: 4.2, 5.0

 Attachments: LUCENE-3918.patch


 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505530#comment-13505530
 ] 

Eks Dev commented on SOLR-4117:
---

fwiw, we *think* we observed the following problem in simple master slave setup 
with NRTCachingDirectory... I am not sure it has something to do with issue, 
because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment



 

 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI. In the GUI we only see:
 {code}
 SolrCore Initialization Failures
 openindex_f: 
 org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
  
 Please check your logs for more information
 {code}
 and in the log we only see the following exception:
 {code}
 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - 
 [http-8080-exec-28] - : IO error while trying to get the size of the 
 Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
 '/opt/solr/cores/shard_f/data/index' does not exist
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
 at 
 org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
 at 
 org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
 at 
 org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
 at 
 org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
 at 
 org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
 at 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
 at 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
 at

[jira] [Comment Edited] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505530#comment-13505530
 ] 

Eks Dev edited comment on SOLR-4117 at 11/28/12 3:27 PM:
-

fwiw, we *think* we observed the following problem in simple master slave setup 
with NRTCachingDirectory... I am not sure it has something to do with issue, 
because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment

Here Exception after  hard restart:

Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.init(SolrCore.java:804)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:618)
   at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:973)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1003)
   ... 10 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1441)
   at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1553)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:779)
   ... 13 more
Caused by: java.io.FileNotFoundException: ...\core0\data\index\segments_1 (The 
system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
   at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
   at 
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
   at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:281)
   at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:668)
   at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
   at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:87)
   at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
   at 
org.apache.solr.search.SolrIndexSearcher.init(SolrIndexSearcher.java:120)
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1417)

 

  was (Author: eksdev):
fwiw, we *think* we observed the following problem in simple master slave 
setup with NRTCachingDirectory... I am not sure it has something to do with 
issue, because ewe did not see this exception, anyhow   

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should 
actually not happen)

We speculate that solr now does replication directly to Directory 
implementation and does not ensure that replicated files get fsck-ed completely 
after replication. As far as I remember, replication was going to /temp (disk) 
and than moving files if all went ok. Working under assumption that everything 
is already persisted. Maybe this invariant does not hold any more and some 
explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really 
debug-friendly environment



 
  
 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI.

[jira] [Commented] (SOLR-4117) IO error while trying to get the size of the Directory

2012-11-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1350#comment-1350
 ] 

Eks Dev commented on SOLR-4117:
---

fsync of course, fsck was intended for my terminal window :) 

 IO error while trying to get the size of the Directory
 --

 Key: SOLR-4117
 URL: https://issues.apache.org/jira/browse/SOLR-4117
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
 Environment: 5.0.0.2012.11.28.10.42.06
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0


 With SOLR-4032 fixed we see other issues when randomly taking down nodes 
 (nicely via tomcat restart) while indexing a few million web pages from 
 Hadoop. We do make sure that at least one node is up for a shard but due to 
 recovery issues it may not be live.
 One node seems to work but generates IO errors in the log and 
 ZookeeperExeption in the GUI. In the GUI we only see:
 {code}
 SolrCore Initialization Failures
 openindex_f: 
 org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
  
 Please check your logs for more information
 {code}
 and in the log we only see the following exception:
 {code}
 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - 
 [http-8080-exec-28] - : IO error while trying to get the size of the 
 Directory:org.apache.lucene.store.NoSuchDirectoryException: directory 
 '/opt/solr/cores/shard_f/data/index' does not exist
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
 at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
 at 
 org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
 at 
 org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
 at 
 org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
 at 
 org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
 at 
 org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
 at 
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
 at 
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4032) Unable to replicate between nodes ( read past EOF)

2012-11-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504859#comment-13504859
 ] 

Eks Dev commented on SOLR-4032:
---

We see it as well, 

it looks like it only happens with NRTCachingDirectory, but take this statement 
with healthy  suspicion. It went ok only once without NRTCachingDirectory. 




 Unable to replicate between nodes ( read past EOF)
 --

 Key: SOLR-4032
 URL: https://issues.apache.org/jira/browse/SOLR-4032
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.0
 Environment: 5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01 
 12:37:38
 Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
Reporter: Markus Jelsma
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Please see: 
 http://lucene.472066.n3.nabble.com/trunk-is-unable-to-replicate-between-nodes-Unable-to-download-completely-td4017049.html
  and 
 http://lucene.472066.n3.nabble.com/Possible-memory-leak-in-recovery-td4017833.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494858#comment-13494858
]

Eks Dev commented on LUCENE-4548:
-

...would be to nuke Filters completely from Lucene ...

User +1

Filter is conceptually nothing more than no-scoring and a possibility to have
an implementation that can be cached.

From the user API point of whew, there is really no need to bother users with
Filter abstraction. Both of these two are just attributes of the query (do you
need to score this clause or would you like to have it cached).

BooleanFilter should optionally pass down further restricted acceptDocs in
the MUST case (and acceptDocs in general)

Key: LUCENE-4548
URL: https://issues.apache.org/jira/browse/LUCENE-4548
Project: Lucene - Core
Issue Type: Bug
Reporter: Uwe Schindler
Attachments: LUCENE-4548.patch

Spin-off from dev@lao:
{quote}
bq. I am about to write a Filter that only operates on a set of documents
that have already passed other filter(s). It's rather expensive, since it
has to use DocValues to examine a value and then determine if its a match.
So it scales O(n) where n is the number of documents it must see. The 2nd
arg of getDocIdSet is Bits acceptDocs. Unfortunately Bits doesn't have an
int iterator but I can deal with that seeing if it extends DocIdSet.
bq. I'm looking at BooleanFilter which I want to use and I notice that it
passes null to filter.getDocIdSet for acceptDocs, and it justifies this with
the following comment:
bq. // we dont pass acceptDocs, we will filter at the end using an additional
filter
the idea of passing the already build bits for the MUST is a good idea and
can be implemented easily.
The reason why the acceptDocs were not passed down is the new way of filter
works in Lucene 4.0 and to optimize caching. Because accept docs are the only
thing that changes when deletions are applied and filters are required to
handle them separately: whenever something is able to cache (e.g.
CachingWrapperFilter), the acceptDocs are not cached, so the underlying
filters get a null acceptDocs to produce the full bitset and the filtering is
done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this
case this does not matter if the first filter clause does not get acceptdocs,
but later MUST clauses of course can get them (they are not
deletion-specific)!
Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
Another thing that could help here: You can stop using BooleanFilter if you
can apply the filters sequentially (only MUST clauses) by wrapping with
multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery,
clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery
autodetection decides to use random access filters, the acceptdocs are also
passed down from the outside to the inner, removing the documents filtered
out.
{quote}
Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down
the acceptDocs to every filter (for the case where Filter calculation is
expensive and accept docs help to limit the calculations) or not passing down
(if the filter is cheap and the multiple acceptDocs bit checks for every
single filter is more expensive – which is then more effective, e.g. when the
Filter is only a cached bitset). The first mode would also optimize the
MUST/MUST_NOT case to pass down the further restricted acceptDocs on later
filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

2012-08-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443897#comment-13443897
 ] 

Eks Dev commented on LUCENE-4226:
-

bq. but I removed the ability to select the compression algorithm on a 
per-field basis in order to make the patch simpler and to handle cross-field 
compression.

Maybe it is worth to keep it there for really short fields. Those general 
compression algorithms are great for bigger amounts of data, but for really 
short fields there is nothing like per field compression.   
Thinking about database usage, e.g. fields with low cardinality, or fields with 
restricted symbol set (only digits in long UID field for example).  Say zip 
code, product color...  is perfectly compressed using something with static 
dictionary approach (static huffman coder with escape symbol-s, at bit level, 
or plain vanilla dictionary lookup), and both of them are insanely fast and 
compress heavily. 

Even trivial utility for users is easily doable, index data without 
compression, get the frequencies from the term dictionary- estimate e.g. 
static Huffman code table and reindex with this dictionary. 


 Efficient compression of small to medium stored fields
 --

 Key: LUCENE-4226
 URL: https://issues.apache.org/jira/browse/LUCENE-4226
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Adrien Grand
Priority: Trivial
 Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
 LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java


 I've been doing some experiments with stored fields lately. It is very common 
 for an index with stored fields enabled to have most of its space used by the 
 .fdt index file. To prevent this .fdt file from growing too much, one option 
 is to compress stored fields. Although compression works rather well for 
 large fields, this is not the case for small fields and the compression ratio 
 can be very close to 100%, even with efficient compression algorithms.
 In order to improve the compression ratio for small fields, I've written a 
 {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
 data. To see how it behaves in terms of document deserialization speed and 
 compression ratio, I've run several tests with different index compression 
 strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
 were indexed and stored):
  - no compression,
  - docs compressed with deflate (compression level = 1),
  - docs compressed with deflate (compression level = 9),
  - docs compressed with Snappy,
  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
 chunks of 6 docs,
  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
 chunks of 6 docs,
  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
 docs.
 For those who don't know Snappy, it is compression algorithm from Google 
 which has very high compression ratios, but compresses and decompresses data 
 very quickly.
 {noformat}
 Format   Compression ratio IndexReader.document time
 
 uncompressed 100%  100%
 doc/deflate 1 59%  616%
 doc/deflate 9 58%  595%
 doc/snappy80%  129%
 index/deflate 1   49%  966%
 index/deflate 9   46%  938%
 index/snappy  65%  264%
 {noformat}
 (doc = doc-level compression, index = index-level compression)
 I find it interesting because it allows to trade speed for space (with 
 deflate, the .fdt file shrinks by a factor of 2, much better than with 
 doc-level compression). One other interesting thing is that {{index/snappy}} 
 is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
 retrieving documents from disk.
 These tests have been done on a hot OS cache, which is the worst case for 
 compressed fields (one can expect better results for formats that have a high 
 compression ratio since they probably require fewer read/write operations 
 from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3684) Frequently full gc while do pressure index

2012-08-07 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429985#comment-13429985
 ] 

Eks Dev commented on SOLR-3684:
---

We did it a long time ago on tomcat, as we use particularly expensive 
analyzers, so even for searching optimum is around Noo cores. Actually, that 
was the only big problem with solr we had.  
 
Actually, anything that keeps insane thread churn low helps. Not only max 
number of threads, but TTL time for idle threads should be also somehow 
increased. The longer threads live, the better. Solr is completely safe due to 
core-reloading and smart Index management, no point in renewing threads.   

If one needs to queue requests, that is just another problem,  but for this 
there no need to up max worker threads to more than number of cores plus some 
smallish constant

What we would like to achieve is to keep separate thread pools for searching, 
indexing and the rest... but we never managed to figure out how to do it. 
even benign, /ping, /status whatever are increasing thread churn... If we 
were able to configure separate pools , we could keep small number of 
long-living threads for searching, even smaller number for indexing and one 
who cares pool for the rest. It is somehow possible on tomcat, if someone 
knows how to do it, please share. 

 Frequently full gc while do pressure index
 --

 Key: SOLR-3684
 URL: https://issues.apache.org/jira/browse/SOLR-3684
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0-ALPHA
 Environment: System: Linux
 Java process: 4G memory
 Jetty: 1000 threads 
 Index: 20 field
 Core: 5
Reporter: Raintung Li
Priority: Critical
  Labels: garbage, performance
 Fix For: 4.0

 Attachments: patch.txt

   Original Estimate: 168h
  Remaining Estimate: 168h

 Recently we test the Solr index throughput and performance, configure the 20 
 fields do test, the field type is normal text_general, start 1000 threads for 
 Jetty, and define 5 cores.
 After test continued for some time, the solr process throughput is down very 
 quickly. After check the root cause, find the java process always do the full 
 GC. 
 Check the heap dump, the main object is StandardTokenizer, it is be saved in 
 the CloseableThreadLocal by IndexSchema.SolrIndexAnalyzer.
 In the Solr, will use the PerFieldReuseStrategy for the default reuse 
 component strategy, that means one field has one own StandardTokenizer if it 
 use standard analyzer,  and standardtokenizer will occur 32KB memory because 
 of zzBuffer char array.
 The worst case: Total memory = live threads*cores*fields*32KB
 In the test case, the memory is 1000*5*20*32KB= 3.2G for StandardTokenizer, 
 and those object only thread die can be released.
 Suggestion:
 Every request only handles by one thread that means one document only 
 analyses by one thread.  For one thread will parse the document’s field step 
 by step, so the same field type can use the same reused component. While 
 thread switches the same type’s field analyzes only reset the same component 
 input stream, it can save a lot of memory for same type’s field.
 Total memory will be = live threads*cores*(different fields types)*32KB
 The source code modifies that it is simple; I can provide the modification 
 patch for IndexSchema.java: 
 private class SolrIndexAnalyzer extends AnalyzerWrapper {
 
   private class SolrFieldReuseStrategy extends ReuseStrategy {
 /**
  * {@inheritDoc}
  */
 @SuppressWarnings(unchecked)
 public TokenStreamComponents getReusableComponents(String 
 fieldName) {
   MapAnalyzer, TokenStreamComponents componentsPerField = 
 (MapAnalyzer, TokenStreamComponents) getStoredValue();
   return componentsPerField != null ? 
 componentsPerField.get(analyzers.get(fieldName)) : null;
 }
 /**
  * {@inheritDoc}
  */
 @SuppressWarnings(unchecked)
 public void setReusableComponents(String fieldName, 
 TokenStreamComponents components) {
   MapAnalyzer, TokenStreamComponents componentsPerField = 
 (MapAnalyzer, TokenStreamComponents) getStoredValue();
   if (componentsPerField == null) {
 componentsPerField = new HashMapAnalyzer, 
 TokenStreamComponents();
 setStoredValue(componentsPerField);
   }
   componentsPerField.put(analyzers.get(fieldName), components);
 }
   }
   
 protected final static HashMapString, Analyzer analyzers;
 /**
  * Implementation of {@link ReuseStrategy} that reuses components 
 per-field by
  * maintaining a Map of

[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField

2012-06-01 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287213#comment-13287213
 ] 

Eks Dev commented on LUCENE-3312:
-

bq. My assumption is that StoredField-s will never be used anymore as potential 
sources of token streams?

One case where it might make sense are scenarios where a user wants to store 
analyzed field (not original) and later to to read it as TokenStream. Kind of 
TermVector without tf. I think I remember seing great patch with 
indexable-storable field (with serialization and deserialization).

A user can do it in two passes, but sumetimes it is a not chep to analyze two 
times



 Break out StorableField from IndexableField
 ---

 Key: LUCENE-3312
 URL: https://issues.apache.org/jira/browse/LUCENE-3312
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless
Assignee: Nikola Tankovic
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: Field Type branch

 Attachments: lucene-3312-patch-01.patch, lucene-3312-patch-02.patch, 
 lucene-3312-patch-03.patch, lucene-3312-patch-04.patch


 In the field type branch we have strongly decoupled
 Document/Field/FieldType impl from the indexer, by having only a
 narrow API (IndexableField) passed to IndexWriter.  This frees apps up
 use their own documents instead of the user-space impls we provide
 in oal.document.
 Similarly, with LUCENE-3309, we've done the same thing on the
 doc/field retrieval side (from IndexReader), with the
 StoredFieldsVisitor.
 But, maybe we should break out StorableField from IndexableField,
 such that when you index a doc you provide two Iterables -- one for the
 IndexableFields and one for the StorableFields.  Either can be null.
 One downside is possible perf hit for fields that are both indexed 
 stored (ie, we visit them twice, lookup their name in a hash twice,
 etc.).  But the upside is a cleaner separation of concerns in API

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

2011-08-08 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated SOLR-2701:
--

Attachment: SOLR-2701.patch

rather simplistic approach, adding userCommitData to CommitUpdateCommand.

So we at least have a vehicle to pass it to IndexWriter.

No advanced machinery to make it available to  non-expert users. At least ti is 
not wrong to have it there?

Eclipse removed some unused imports from DUH2 as well   

 Expose IndexWriter.commit(MapString,String commitUserData) to solr 
 -

 Key: SOLR-2701
 URL: https://issues.apache.org/jira/browse/SOLR-2701
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 4.0
Reporter: Eks Dev
Priority: Minor
  Labels: commit, update
 Attachments: SOLR-2701.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 At the moment, there is no feature that enables associating user information 
 to the commit point.
  
 Lucene supports this possibility and it should be exposed to solr as well, 
 probably via beforeCommit Listener (analogous to prepareCommit in Lucene).
 Most likely home for this Map to live is UpdateHandler.
 Example use case would be an atomic tracking of sequence numbers or 
 timestamps for incremental updates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2701) Expose IndexWriter.commit(MapString,String commitUserData) to solr

2011-08-06 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080474#comment-13080474
]

Eks Dev commented on SOLR-2701:
---

one hook for users to update content of this map would be to add beforeCommit
callbacks. This looks simple enough in UpdateHandler2.commit() call, but there
is a catch:

We need to invoke listeners before we close() for implicit commits... having
decref-ed IndexWriter, the question is if we want to run beforeCommit listeners
even if IW does not really get closed (user updates map more often than
needed).

IMO, this should not be a problem, invoking callbacks a little bit more often
than needed.

Another place where we have implicit commit is newIndexWriter() /
here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we
need callbacks

A solution for close() would be also simple by adding
IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref()
to condition callbacks

Any better solution? Are the callbacks good approach to provide user hooks for
this?

---
Another approach is to get beforeCommitCallbacks at lucene level and piggy-back
there for solr callbacks?
We would only need to change IndexWriter.commit(Map..) and close() but commit
is final...

Notice: I am very rusty considering solr/lucene codebase = any help would be
appreciated. Last patch I made here is ages ago :)

Expose IndexWriter.commit(MapString,String commitUserData) to solr
-

Key: SOLR-2701
URL: https://issues.apache.org/jira/browse/SOLR-2701
Project: Solr
Issue Type: New Feature
Components: update
Affects Versions: 4.0
Reporter: Eks Dev
Priority: Minor
Labels: commit, update
Original Estimate: 8h
Remaining Estimate: 8h

At the moment, there is no feature that enables associating user information
to the commit point.

Lucene supports this possibility and it should be exposed to solr as well,
probably via beforeCommit Listener (analogous to prepareCommit in Lucene).
Most likely home for this Map to live is UpdateHandler.
Example use case would be an atomic tracking of sequence numbers or
timestamps for incremental updates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1879) Parallel incremental indexing

2011-08-01 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073462#comment-13073462
 ] 

Eks Dev commented on LUCENE-1879:
-

The user mentioned above in comment was me, I guess. Commenting here just to 
add interesting use case that would be perfectly solved by this issue.  

Imagine solr Master - Slave setup, full document contains CONTENT and ID 
fields, e.g. 200Mio+ collection. On master, we need field ID indexed in order 
to process delete/update commands. On slave, we do not need lookup on ID and 
would like to keep our TermsDictionary small, without exploding TermsDictionary 
with 200Mio+ unique ID terms (ouch, this is a lot compared to 5Mio unique terms 
in CONTENT, with or without pulsing). 

With this issue,  this could be nativly achieved by modifying solr 
UpdateHandler not to transfer ID-Index to slaves at all.

There are other ways to fix it, but this would be the best.(I am currently 
investigating an option to transfer full index on update, but to filter-out 
TermsDictionary on IndexReader level (it remains on disk, but this part never 
gets accessed on slaves). I do not know yet if this is possible at all in 
general , e.g. FST based term dictionary is already built (prefix compressed 
TermDict would be doable)

 Parallel incremental indexing
 -

 Key: LUCENE-1879
 URL: https://issues.apache.org/jira/browse/LUCENE-1879
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/index
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 4.0

 Attachments: parallel_incremental_indexing.tar


 A new feature that allows building parallel indexes and keeping them in sync 
 on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
 Find details on the wiki page for this feature:
 http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
 Discussion on java-dev:
 http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-08 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061804#comment-13061804
 ] 

Eks Dev commented on LUCENE-3289:
-

bq. The strings are extremely long (more like short documents) and probably 
need to be compressed in some different datastructure, e.g. a word-based one?

That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago 
we used one trie, for all unique terms to get prefix/edit distance on words and 
one word-trie (symbols were words via symbol table) for documents. I am sure 
this would cut memory requirements significantly for multiword cases when 
compared to char level FST.
e.g. TermDictionary that supports ord() could be used as a symbol table.






 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3135) backport suggest module to branch 3.x

2011-05-24 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038418#comment-13038418
 ] 

Eks Dev commented on LUCENE-3135:
-

 if we can backport the FST-based functionality
+1

 backport suggest module to branch 3.x
 -

 Key: LUCENE-3135
 URL: https://issues.apache.org/jira/browse/LUCENE-3135
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Reporter: Robert Muir

 It would be nice to develop a plan to expose the autosuggest functionality to 
 Lucene users in 3.x
 There are some complications, such as seeing if we can backport the FST-based 
 functionality,
 which might require a good bit of work. But I think this would be well-worth 
 it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

2010-07-26 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892341#action_12892341
 ] 

Eks Dev commented on LUCENE-2557:
-

It looks like we have one invariant:
IDF(QueryTerm) = IDF(Expansion Term) // Preventing better scoring documents 
with ET then Documents with exact match on QT.

Fixing all expansions to IDF(QT) would remove dynamics of the score, making the 
contribution to the score  for all expansions identical. Maybe proportionally 
scaling IDF of all expansions  to preserve mutual IDF dynamics, (relative to 
IDF(QT) to keep-up with invariant)  would work better?

In case when there is no matching QueryTerm, why not simply preserving 
expansion Term IDF, what is averaging good for, performance?

 FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
 --

 Key: LUCENE-2557
 URL: https://issues.apache.org/jira/browse/LUCENE-2557
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 3.0.2
Reporter: Jingkei Ly
 Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch


 The FuzzyQuery often causes misspellings to be ranked higher than the exact 
 match, which seems to be an undesirable property generally. 
 For example, in an index of surnames, if I search using a FuzzyQuery for 
 smith, the misspellings such as smiith, or smiht would appear near the 
 top of the search results ahead of documents that match smith.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872386#action_12872386
]

Eks Dev commented on LUCENE-2482:
-

Re: I'm not sure if I follow your use case though

Simple case, you have a 100Mio docs with 2 fields, CITY and TEXT

sorting on CITY makes postings look like:
Orlando: -
New York:
-
perfectly compressible.

without really affecting distribution (compressibility) of terms from the TEXT
field.

If CITY would remain in unsorted order (e.g. uniform distribution), you deal
with very large postings for all terms coming from this field

Sorting on many fields helps often, e.g. if you have hierarchical compositions
like 1 CITY with many ZIP_CODES... philosophically, sorting always increases
compressibility and improves locality of reference... but sure, you need to
know what you want

Index sorter

Key: LUCENE-2482
URL: https://issues.apache.org/jira/browse/LUCENE-2482
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki
Fix For: 3.1

Attachments: indexSorter.patch

A tool to sort index according to a float document weight. Documents with
high weight are given low document numbers, which means that they will be
first evaluated. When using a strategy of early termination of queries (see
TimeLimitedCollector) such sorting significantly improves the quality of
partial results.
(Originally this tool was created by Doug Cutting in Nutch, and used norms as
document weights - thus the ordering was limited by the limited resolution of
norms. This is a pure Lucene version of the tool, and it uses arbitrary
floats from a specified stored field).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833860#action_12833860
]

Eks Dev commented on LUCENE-329:

{quote}
query for John~ Patitucci~ I'm probably more interested in a partial match on
the rarer surname than a partial match on the common forename.
{quote}

as a matter of fact, we have not only one frequency to consider, rather two
Term frequencies!

consider simpler case
Query term: Johan //would be High frequency term
gives:
Fuzzy Expanded term1 Johana // High frequency
Fuzzy Expanded term2 Joahn // Low Freq

I guess you would like to score the second term higher, meaning Lower frequency
(higher IDF)... So far so good.

Now turn it upside down and search for LF typo Joahn... in that case you
would preffer HF Term Johan from expanded list to score higher...

Point being, this situation here is just not complete without taking both
frequencies into consideration (Query Term and Expanded term). In my
experience, some simple nonlinear hints based on these two freqs bring some
easy precision points (HF-LF Pairs are much more likely to be typos that two
HF-HF... ).

Fuzzy query scoring issues
--

Key: LUCENE-329
URL: https://issues.apache.org/jira/browse/LUCENE-329
Project: Lucene - Java
Issue Type: Bug
Components: Search
Affects Versions: 1.2rc5
Environment: Operating System: All
Platform: All
Reporter: Mark Harwood
Priority: Minor
Attachments: patch.txt

Queries which automatically produce multiple terms (wildcard, range, prefix,
fuzzy etc)currently suffer from two problems:
1) Scores for matching documents are significantly smaller than term queries
because of the volume of terms introduced (A match on query Foo~ is 0.1
whereas a match on query Foo is 1).
2) The rarer forms of expanded terms are favoured over those of more common
forms because of the IDF. When using Fuzzy queries for example, rare mis-
spellings typically appear in results before the more common correct
spellings.
I will attach a patch that corrects the issues identified above by
1) Overriding Similarity.coord to counteract the downplaying of scores
introduced by expanding terms.
2) Taking the IDF factor of the most common form of expanded terms as the
basis of scoring all other expanded terms.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-12 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832911#action_12832911
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
...Aaron i think generation may pose a problem for a full unicode alphabet...
{quote}

I wouldn't discount Aron's approach so quickly! There is one *really smart*  
way to aproach generation of the distance negborhood. Have a look at FastSS 
http://fastss.csg.uzh.ch/  The trick is to delete, not to genarate variations 
over complete  alphabet! They call it deletion negborhood.  Also, generates 
much less variation Terms, reducing pressure on binary search in TermDict!

You do not get all these goodies from Weighted  distance implementation, but 
the solution is much simpler. Would work similary to the  current spellchecker 
(just lookup on variations), only  faster. They have even some exemple code 
to see how they generate deletions 
(http://fastss.csg.uzh.ch/FastSimilarSearch.java).

{quote}
but the more intelligent stuff you speak of could be really cool esp. for 
spellchecking, sure you dont want to rewrite our spellchecker?

btw its not clear to me yet, could you implement that stuff on top of ghetto 
DFA (the sorted terms dict we have now) or is something more sophisticated 
needed? its a lot easier to write this stuff now with the flex MTQ apis 
{quote}

I really  would love to, but I was paid before to work on this. 

I guess  gheto dfa would not work, at least not fast enough (I didn't think 
about it really). Practically you would need to know which characters extend 
current character in you dictionary, or in DFA parlance, all outgoing 
transitions from the current state. gheto dfa cannot do it efficiently?

What would be an idea with flex is to implement this stuff with an in memory 
trie (full trie or TST), befor jumping into noisy channel (this is easy to add 
later) and persistent trie-dictionary.  The traversal part is identical,  and  
would make a nice contrib with a usefull use case as the majority of folks have 
 enogh memory to slurp complete termDict into memory... Would serve as a proof 
of concept for flex and fuzzyQ,  help you understand the magic of calculating 
edit distance against Trie structures. Once you have trie structure, the sky is 
the limit, prefix, regex... If I remeber corectly, there were some trie 
implmentations floating around, with it you need just one extra traversal 
method to find all terms at distance N. You can have a look at 
http://jaspell.sourceforge.net/; TST implmentation, class 
TernarySearchTrie.matchAlmost(...) methods. Just for an ilustration what is 
going there, it is simple recursive traversal of all terms at max distance of N.
Later we could tweak memory demand, switch to some more compact trie... and at 
the and add weighted distance and convince Mike to make blasing fast persisten 
trie :)... in meantime, the folks with enogh memory would have really really 
fast fuzzy, prefix... better distance... 



So the theory :) I hope you find these comments usful, even without patches



 


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832424#action_12832424
 ] 

Eks Dev commented on LUCENE-2089:
-

{quote}
 What about this,
http://www.catalysoft.com/articles/StrikeAMatch.html
it seems logically more appropriate to (human-entered) text objects than 
Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance 
faster? 
{quote}

Is that only me who sees plain, vanilla bigram distance here? What is new or 
better in StrikeAMatch compared to the first phase of the current SpellCehcker 
(feeding PriorityQueue with candidates)? 

If you need too use this, nothing simpler, you do not even need pair comparison 
(aka traversal), just Index terms split into bigrams and search with standard 
Query. 


Autmaton trick is a neat one. Imo,  the only thing that would work better is to 
make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making 
TerrmDict some sort of trie/dfa would permit smart beam-search,  even without 
compiling query DFA. Beam search also makes implementation of better distances 
possible (Weighted Edit distance without metric constraint ). I guess this is 
going to be possible with Flex, Mike was allready talking about DFA Dictionary 
:)

It took a while to figure out the trick Robert pooled here, treating term 
dictionary as another DFA due to the sortedness, nice. 

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-11 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832741#action_12832741
]

Eks Dev commented on LUCENE-2089:
-

{quote}
I assume you mean by weighted edit distance that the transitions in the state
machine would have costs?
{quote}

Yes, kind of, not embedded in the trie, just defined externally.

What I am talking about is a part of the noisy channel approach, modeling only
channel distribution. Have a look at the http://norvig.com/spell-correct.html
for basic theory. I am suggesting almost the same, just applied at character
level and without language model part. It is rather easy once you have your
dictionary in some sort of tree structure.

You guide your trie traversal over the trie by iterating on each char in your
search term accumulating log probabilities of single transformations
(recycling prefix part). When you hit a leaf insert into PriorityQueue of
appropriate depth. What I mean by probabilities of single transformations are
defined as:
insertion(character a)//map char-log probability (think of it as kind of cost
of inserting this particular character)
deletion(character)//map char-log probability...
transposition(char a, char b)
replacement(char a, char b)//2D matrix char,char-probability (cost)
if you wish , you could even add some positional information, boosting match on
start/end of the string

I avoided tricky mechanicson traversal, insertion, deletion, but on trie you
can do it by following different paths...

the only good implementation (in memory) around there I know of is in LingPipe
spell checker (they implement full Noisy Channel, with Language model driving
traversal)... has huge educational value, Bob is really great at explaining
things. The code itself is proprietary.
I would suggest you to peek into this code to see this 2-Minute rumbling I
wrote here properly explained :) Just ignore the language model part and assume
you have NULL language model (all chars in language are equally probable) ,
doing full traversal over the trie.

{quote}
If this is the case couldn't we even define standard levenshtein very easily
(instead of nasty math), and would the beam search technique enumerate
efficiently for us?
{quote}
Standard Lev. is trivially configured once you have this, it is just setting
all these costs to 1 (delete, insert... in log domain)... But who would use
standard distance with such a beast, reducing impact of inserting/deleting
silent h as in Thomas Tomas...
Enumeration is trie traversal, practically calculating distance against all
terms at the same time and collectiong N best along the way. The place where
you save your time is recycling prefix part in this calculation. Enumeration is
optimal as this trie there contains only the terms from termDict, you are not
trying all possible alphabet characters and you can implement early path
abandoning easily ether by cost (log probability) or/and by limiting the
number of successive insertions

If interested in really in depth things, look at
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
Great book, (another great tip from b...@lingpipe). A bit strange with
terminology (at least to me), but once you get used to it, is really worth the
time you spend trying to grasp it.

explore using automaton for fuzzyquery
--

Key: LUCENE-2089
URL: https://issues.apache.org/jira/browse/LUCENE-2089
Project: Lucene - Java
Issue Type: Wish
Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java

Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is
itching to write that nasty algorithm)
we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
* up front, calculate the maximum required K edits needed to match the users
supplied float threshold.
* for at least small common E up to some max K (1,2,3, etc) we should create
a DFA for each E.
if the required E is above our supported max, we use dumb mode at first (no
seeking, no DFA, just brute force like now).
As the pq fills, we swap progressively lower DFAs into the enum, based upon
the lowest score in the pq.
This should work well on avg, at high E, you will typically fill the pq very
quickly since you will match many terms.
This not only provides a mechanism to switch to more efficient DFAs during
enumeration, but also to switch from dumb mode to smart mode.
i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

[jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
]

Eks Dev commented on LUCENE-1410:
-

Mike,
That is definitely the way to go, distribution dependent encoding, where every
Term gets individual treatment.

Take for an example simple, but not all that rare case where Index gets sorted
on some of the indexed fields (we use it really extensively, e.g. presorted doc
collection on user_rights/zip/city, all indexed). There you get perfectly
compressible postings by simply managing intervals of set bits. Updates
distort this picture, but we rebuild index periodically and all gets good
again. At the moment we load them into RAM as Filters in IntervalSets. if that
would be possible in lucene, we wouldn't bother with Filters (VInt decoding on
such super dense fields was killing us, even in RAMDirectory) ...

Thinking about your comments, isn't pulsing somewhat orthogonal to packing
method? For example, if you load index into RAMDirecectory, one could avoid one
indirection level and inline all postings.

Flex Indexing rocks, that is going to be the most important addition to lucene
since it started (imo)... I would even bet on double search speed in first
attempt for average queries :)

Cheers,
eks

PFOR implementation
---

Key: LUCENE-1410
URL: https://issues.apache.org/jira/browse/LUCENE-1410
Project: Lucene - Java
Issue Type: New Feature
Components: Other
Reporter: Paul Elschot
Priority: Minor
Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2,
LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch,
LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java,
TestPFor2.java

Original Estimate: 21840h
Remaining Estimate: 21840h

Implementation of Patched Frame of Reference.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735809#action_12735809
 ] 

Eks Dev commented on LUCENE-1762:
-

cool, thanks for the review.   

 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch, 
 LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)

Slightly more readable code in TermAttributeImpl 
-

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial


No big deal. 

growTermBuffer(int newSize) was using correct, but slightly hard to follow 
code. 

the method was returning null as a hint that the current termBuffer has enough 
space to the upstream code or reallocated buffer.

this patch simplifies logic   making this method to only reallocate buffer, 
nothing more.  
It reduces number of if(null) checks in a few methods and reduces amount of 
code. 
all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1762:


Attachment: LUCENE-1762.patch

 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1762:


Attachment: LUCENE-1762.patch

made the changes in Token along the same lines, 

- had to change one constant in TokenTest as I have changed initial allocation 
policy of termBuffer to be consistent with Arayutils.getnextSize()

if(termBuffer==null)

NEW:
 termBuffer = new char[ArrayUtil.getNextSize(newSize  MIN_BUFFER_SIZE ? 
MIN_BUFFER_SIZE : newSize)]; 

OLD:
termBuffer = new char[newSize  MIN_BUFFER_SIZE ? MIN_BUFFER_SIZE : newSize]; 

not sure if this is better, but looks more consistent to me (buffer size is 
always determined via getNewSize())

Uwe, 
setOnlyUseNewAPI(false) does not exist, it was removed with some of the patches 
lately. It gets automatically detected via reflection?



 Slightly more readable code in TermAttributeImpl 
 -

 Key: LUCENE-1762
 URL: https://issues.apache.org/jira/browse/LUCENE-1762
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
 Attachments: LUCENE-1762.patch, LUCENE-1762.patch


 No big deal. 
 growTermBuffer(int newSize) was using correct, but slightly hard to follow 
 code. 
 the method was returning null as a hint that the current termBuffer has 
 enough space to the upstream code or reallocated buffer.
 this patch simplifies logic   making this method to only reallocate buffer, 
 nothing more.  
 It reduces number of if(null) checks in a few methods and reduces amount of 
 code. 
 all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1762) Slightly more readable code in TermAttributeImpl

2009-07-25 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eks Dev updated LUCENE-1762:

Attachment: LUCENE-1762.patch

- made allocation in initTermBuffer() consistent with
ArrayUtil.getNextSize(int) - this is ok not to start with MIN_BUFFER_SIZE, but
rather with ArrayUtil.getNextSize(MIN_BUFFER_SIZE)... e.g. if getNextSize gets
very sensitive to initial conditions one day...

- null-ed termText on switch to termBuffer in resizeTermBuffer (as it was
before!) . This was a bug in previous patch

Slightly more readable code in TermAttributeImpl
-

Key: LUCENE-1762
URL: https://issues.apache.org/jira/browse/LUCENE-1762
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Reporter: Eks Dev
Assignee: Uwe Schindler
Priority: Trivial
Attachments: LUCENE-1762.patch, LUCENE-1762.patch, LUCENE-1762.patch

No big deal.
growTermBuffer(int newSize) was using correct, but slightly hard to follow
code.
the method was returning null as a hint that the current termBuffer has
enough space to the upstream code or reallocated buffer.
this patch simplifies logic making this method to only reallocate buffer,
nothing more.
It reduces number of if(null) checks in a few methods and reduces amount of
code.
all tests pass.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-14 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731085#action_12731085
 ] 

Eks Dev commented on LUCENE-1743:
-

indeed! obvious idea, 

the only thing I do not like with it is making these hidden, deceptive 
decisions I said I want MMapDirectory and someone else decided something else 
for me... it does not matter if we have conses here now, it may change 
tomorrow 

probably better way would be to turbo charge FileSwitchDirectory with sexy 
parametrization options, 
MMapDirectory - F(fileExtension, minSize, maxSize) // If fileExtension and 
file size less than maxSize and greater than minSize than open file with 
MMapDirectory... than go on on next rule... (can be designed upside down as 
well... changes nothing in idea)

the same for RAMDir, NIO, FS... 

With this, we can make UwesBestOfMMapDirectoryFor32BitOSs (your proposal here) 
or 
HighlyConcurentForWindows64WithTermDictionaryInRamAndStoredFieldsOnDiskDirectory
 just for me :) 

So the most of the end users take some smart defaults we provide in core, and 
freaks (Expert users in official lingo :) have their job easy, just to 
configure TurboChargedFileSwitchDirectory

Should be easy to come up with clean design for these Concrete Directory 
selection rules by keeping concrete Directories pure

Cheers, Eks 




 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS

2009-07-14 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731104#action_12731104
 ] 

Eks Dev commented on LUCENE-1743:
-

right, it is not everything about reading index, you have to write it as well...

why not making  it an abstract class with 
abstract Directory getDirectory(String file, int minSize, int maxSize, String 
[read/write/append], String context);
String getName(); // for logging
   
What do you understand under context? Something along the lines /Give me 
directory for segment merges, read only for search./ 
...Maybe one day we will have possibility not to kill OS cache by merging,



 MMapDirectory should only mmap large files, small files should be opened 
 using SimpleFS/NIOFS
 -

 Key: LUCENE-1743
 URL: https://issues.apache.org/jira/browse/LUCENE-1743
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


 This is a followup to LUCENE-1741:
 Javadocs state (in FileChannel#map): For most operating systems, mapping a 
 file into memory is more expensive than reading or writing a few tens of 
 kilobytes of data via the usual read and write methods. From the standpoint 
 of performance it is generally only worth mapping relatively large files into 
 memory.
 MMapDirectory should get a user-configureable size parameter that is a lower 
 limit for mmapping files. All files with a sizelimit should be opened using 
 a conventional IndexInput from SimpleFS or NIO (another configuration option 
 for the fallback?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts

2009-07-13 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730560#action_12730560
]

Eks Dev commented on LUCENE-1741:
-

Uwe, you convinced me, I looked at the code, and indeed, no performance penalty
for this.

what helped me was 1.1G... (I've tried to find maximum); Max file size is 1.4G
... but 1.1 is just OS coincidence, no magic about it.

I guess 512mb makes a good value, if memory is so fragmented that you cannot
allocate 0.5G, you are definitely having some other problems around. We are
taliking here about VM memory, and even on windows having 512Mb in block is not
an issue (or better said, I have never seen problems with this value).

@Paul: It is misunderstanding, my algorithm was meant to be manual... no
catching OOM and retry (I've burned my fingers already on catching
RuntimeException, do only when absolutely desperate :). Uwe made this value
user settable anyhow.

Thanks Uwe!

Make MMapDirectory.MAX_BBUF user configureable to support chunking the index
files in smaller parts
---

Key: LUCENE-1741
URL: https://issues.apache.org/jira/browse/LUCENE-1741
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1741.patch, LUCENE-1741.patch

This is a followup for java-user thred:
http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b
It is easy to implement, just add a setter method for this parameter to
MMapDir.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725168#action_12725168
 ] 

Eks Dev commented on LUCENE-1720:
-

it's been late for this issue, but maybe worth thinking about. We could change 
semantics of this problem completely. Imo, the problem can be reformulated as 
Provide possibility to cancel running queries on best effort basis, with or 
without providing so far collected results

That would leave Timer management to the end users and make an issue focus on 
one Lucene core ... Timeout management can be then provided as an example 
somewhere How to implement Timeout management using ...








 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2009-06-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725182#action_12725182
 ] 

Eks Dev commented on LUCENE-1720:
-

Sure, I just wanted to sharpen definition what is Lucene core issue, and what 
we can leave to end users. It is not only about the time, rather about 
canceling search requests (even better, general activities). 

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1594) Use source code specialization to maximize search performance

2009-05-07 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707116#action_12707116
 ] 

Eks Dev commented on LUCENE-1594:
-

huh, it reduces hardware costs 2-3 times for larger setup! great

 Use source code specialization to maximize search performance
 -

 Key: LUCENE-1594
 URL: https://issues.apache.org/jira/browse/LUCENE-1594
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: FastSearchTask.java, LUCENE-1594.patch, 
 LUCENE-1594.patch, LUCENE-1594.patch


 Towards eeking absolute best search performance, and after seeing the
 Java ghosts in LUCENE-1575, I decided to build a simple prototype
 source code specializer for Lucene's searches.
 The idea is to write dynamic Java code, specialized to run a very
 specific query context (eg TermQuery, collecting top N by field, no
 filter, no deletions), compile that Java code, and run it.
 Here're the performance gains when compared to trunk:
 ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%||
 |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}|
 |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}|
 |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}|
 |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}|
 |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}|
 |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}|
 |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}|
 |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}|
 |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}|
 |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}|
 |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}|
 |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}|
 |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}|
 |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}|
 |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}|
 |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}|
 |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}|
 |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}|
 Those tests were run on a 19M doc wikipedia index (splitting each
 Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10
 But: it only works with TermQuery for now; it's just a start.
 It should be easy for others to run this test:
   * apply patch
   * cd contrib/benchmark
   * run python -u bench.py -delindex /path/to/index/with/deletes
 -nodelindex /path/to/index/without/deletes
 (You can leave off one of -delindex or -nodelindex and it'll skip
 those tests).
 For each test, bench.py generates a single Java source file that runs
 that one query; you can open
 contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java
 to see it.  I'll attach an example.  It writes results.txt, in Jira
 table format, which you should be able to copy/paste back here.
 The specializer uses pretty much every search speedup I can think of
 -- the ones from LUCENE-1575 (to score or not, to maxScore or not),
 the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels,
 don't use docID for tie breaking), LUCENE-1536 (random access
 filters).  It bypasses TermDocs and interacts directly with the
 IndexInput, and with BitVector for deletions.  It directly folds in
 the collector, if possible.  A filter if used must be random access,
 and is assumed to pre-multiply-in the deleted docs.
 Current status:
   * I only handle TermQuery.  I'd like to add others over time...
   * It can collect by score, or single field (with the 3 scoring
 options in LUCENE-1575).  It can't do reverse field sort nor
 multi-field sort now.
   * The auto-gen code (gen.py) is rather hideous.  It could use some
 serious refactoring, etc.; I think we could get it to the point
 where each Query can gen its own specialized code, maybe.  It also
 needs to be eventually ported to Java.
   * The script runs old, then new, then checks that the topN results
 are identical, and aborts if not.  So I'm pretty sure the
 specialized code is working correctly, for the cases I'm testing.
   * The patch includes a few small changes to core, mostly to open up
 package protected APIs so I can access stuff
 I think this is an interesting effort for several reasons:
   * It gives us a best-case upper bound

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704561#action_12704561
]

Eks Dev commented on LUCENE-1518:
-

imo, it is really not all that important to make Filter and Query the same
(that is just one alternative to achieve goal).

Basic problem we try to solve is adding Filter directly to BoolenQuery, and
making optimizations after that easier. Wrapping with CSQ is just adding anothe
layer between Lucene search machinery and Filter, making these optimizations
harder.

On the other hand, I must accept, conceptually FIter and Query are the same,
supporting together following options:
1. Pure boolean model: You do not care about scores (today we can do it only
wia CSQ, as Filter does not enter BoolenQuery)
2. Mixed boolean and ranked: you have to define Filter contribution to the
documents (CSQ)
3. Pure ranked: No filters, all gets scored (the same as 2.)

Ideally, as a user, I define only Query (Filter based or not) and for each
clause in my Query define
Query.setScored(true/false) or useConstantScore(double score);

also I should be able to say, Dear Lucene please materialize this
Query_Filter for me as I would like to have it cached and please store only
DocIds (Filter today). Maybe open possibility to open possibility to cache
scores of the documents as well.

one thing is concept and another is optimization. From optimization point of
view, we have couple of decisions to make:

- DocID Set supports random access, yes or no (my Materialized Query)
- Decide if clause should / should not be scored/ or should be constant

So, for each Query we need to decide/support:

- scoring{yes, no, constant} and
- opening option to materialize Query (that is how we today create Filters
today)
- these Materialized Queries (aka Filter) should be able to tell us if they
support random access, if they cache only doc id's or scores as well

nothing usefull in this email, just thinking aloud, sometimes helps :)

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

This issue presents a patch, that merges Queries and Filters in a way, that
the new Filter class extends Query. This would make it possible, to use every
filter as a query.
The new abstract filter class would contain all methods of
ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the
Filter's getDocIdSet()/bits() methods he has nothing more to do, he could
just use the filter as a normal query.
I do not want to completely convert Filters to ConstantScoreQueries. The idea
is to combine Queries and Filters in such a way, that every Filter can
automatically be used at all places where a Query can be used (e.g. also
alone a search query without any other constraint). For that, the abstract
Query methods must be implemented and return a default weight for Filters
which is the current ConstantScore Logic. If the filter is used as a real
filter (where the API wants a Filter), the getDocIdSet part could be directly
used, the weight is useless (as it is currently, too). The constant score
default implementation is only used when the Filter is used as a Query (e.g.
as direct parameter to Searcher.search()). For the special case of
BooleanQueries combining Filters and Queries the idea is, to optimize the
BooleanQuery logic in such a way, that it detects if a BooleanClause is a
Filter (using instanceof) and then directly uses the Filter API and not take
the burden of the ConstantScoreQuery (see LUCENE-1345).
Here some ideas how to implement Searcher.search() with Query and Filter:
- User runs Searcher.search() using a Filter as the only parameter. As every
Filter is also a ConstantScoreQuery, the query can be executed and returns
score 1.0 for all matching documents.
- User runs Searcher.search() using a Query as the only parameter: No change,
all is the same as before
- User runs Searcher.search() using a BooleanQuery as parameter: If the
BooleanQuery does not contain a Query that is subclass of Filter (the new
Filter) everything as usual. If the BooleanQuery only contains exactly one
Filter and nothing else the Filter is used as a constant score query. If
BooleanQuery contains clauses with Queries and Filters the new algorithm
could be used: The queries are executed and the results filtered with the
filters.
For the user this has the main advantage: That he can construct his query
using a simplified API without thinking about Filters oder Queries, you can
just combine clauses

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704613#action_12704613
]

Eks Dev commented on LUCENE-1518:
-

Shai,
Regarding pure ranked, CSQ is really what we need, no? ---

Yep, it would work for Filters, but why not making it possible to have normal
Query constant score. For these cases, I am just not sure if this aproach
gets max performance (did not look at this code for quite a while).

Imagine you have a Query and you are not interested in Scoring at all, this can
be acomplished with only DocID iterator arithmetic, ignoring score() totally.
But that is only an optimization (maybe allready there?)

Paul,
How about materializing the DocIds _and_ the score values?
exactly, that would open full caching posibility (original purpose of
Filters). Think Search Results caching ... that is practically another name
for search() method. It is easy to create this, but using it again would
require some bigger changes :)

Filter_on_Steroids materialize(boolean without_score);

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-04-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704618#action_12704618
]

Eks Dev commented on LUCENE-1518:
-

Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight;
the no scoring case is handled by not asking for score values...

Me: ...Imagine you have a Query and you are not interested in Scoring at all,
this can be acomplished with only DocID iterator arithmetic, ignoring score()
totally. But that is only an optimization (maybe allready there?)...

I knew Paul will kick in at this place, he sad exactly the same thing I did,
but, as oposed to me, he made formulation that executes :)
Pfff, I feel bad :)

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703543#action_12703543
 ] 

Eks Dev commented on LUCENE-1619:
-

thanks Mike

 TermAttribute.termLength() optimization
 ---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1619.patch


public int termLength() {
  initTermBuffer(); // This patch removes this method call 
  return termLength;
}
 I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
 could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703085#action_12703085
 ] 

Eks Dev commented on LUCENE-1616:
-

I am ok with both options, removing separate looks a bit better for me as it 
forces users to think attomic about offset = {start, end}. 

If you separate start and end offset too far in your code, probability that you 
do not see mistake somewhere is higher compared to the case where you manage 
start and end on your own in these cases (this is then rather explicit in you 
code)... 

But that is all really something we should not think too much about it :) We 
make no mistakes eather way
 
I can provide new patch, if needed. 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

whoops, this time it compiles :)

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703254#action_12703254
 ] 

Eks Dev commented on LUCENE-1616:
-

me too, sorry! 
Eclipse left me blind for some funny reason
waiting for test to complete before I commit again ... 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

ok, maybe this time it will work, I hope I managed to clean it up (core build 
and test pass). 

The only thing that fails is contrib, but I guess this has nothing to do with 
it? 


[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac]   ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac] ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703335#action_12703335
 ] 

Eks Dev commented on LUCENE-1616:
-

ant build-contrib 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1619:


Attachment: LUCENE-1619.patch

 TermAttribute.termLength() optimization
 ---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


public int termLength() {
  initTermBuffer(); // This patch removes this method call 
  return termLength;
}
 I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
 could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)

TermAttribute.termLength() optimization
---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


   public int termLength() {
 initTermBuffer(); // This patch removes this method call 
 return termLength;
   }

I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
could be wrong?



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703406#action_12703406
 ] 

Eks Dev commented on LUCENE-1618:
-

Maybe, 
FileSwitchDirectory should have possibility to get file list/extensions that 
should be loaded into RAM... making it maintenance free, pushing this decision 
to end user... if, and when we decide to support users in it, we could than 
maintain static list at separate place . Kind of separate execution and 
configuration

I *think* I saw something similar Ning Lee made quite a while ago, from hadoop 
camp (indexing on hadoop something...). But cannot remember what was it :(


  

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)

deprecated method used in fieldsReader / setOmitTf()


 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial


setOmitTf(boolean) is deprecated and should not be used by core classes. One 
place where it appears is FieldsReader , this patch fixes it. It was necessary 
to change Fieldable to AbstractField at two places, only local variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1615:


Attachment: LUCENE-1615.patch

 deprecated method used in fieldsReader / setOmitTf()
 

 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1615.patch


 setOmitTf(boolean) is deprecated and should not be used by core classes. One 
 place where it appears is FieldsReader , this patch fixes it. It was 
 necessary to change Fieldable to AbstractField at two places, only local 
 variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-26 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702901#action_12702901
 ] 

Eks Dev commented on LUCENE-1615:
-

sure, replacing Fieldable is good,  just noticed quick win when cleaning-up 
deprecations from our code base... one step in a time 

 deprecated method used in fieldsReader / setOmitTf()
 

 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1615.patch


 setOmitTf(boolean) is deprecated and should not be used by core classes. One 
 place where it appears is FieldsReader , this patch fixes it. It was 
 necessary to change Fieldable to AbstractField at two places, only local 
 variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-26 Thread Eks Dev (JIRA)

add one setter for start and end offset to OffsetAttribute
--

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial


add OffsetAttribute. setOffset(startOffset, endOffset);

trivial change, no JUnit needed

Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-26 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701279#action_12701279
]

Eks Dev commented on LUCENE-1606:
-

Robert,
in order for Lev. Automata to work, you need to have the complete dictionary as
DFA. Once you have dictionary as DFA (or any sort of trie), computing simple
regex-s or simple fixed or weighted Levenshtein distance becomes a snap.
Levenshtein-Automata is particularity fast at it, much simpler and only
slightly slower method (one pager code)
K.Oflazerhttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862

As said, you cannot really walk current term dictionary as automata/trie (or
you have an idea on how to do that?). I guess there is enough application where
stoing complete Term dictionary into RAM-DFA is not a problem. Even making some
smart (heavily cached) persistent trie/DFA should not be all that complex.

Or you intended just to iterate all terms, and compute distance faster break
LD Matrix computation as soon as you see you hit the boundary? But this
requires iteration over all terms?

I have done something similar, in memory, but unfortunately someone else paid
me for this and is not willing to share...

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

Attached is a patch for an AutomatonQuery/Filter (name can change if its not
suitable).
Whereas the out-of-box contrib RegexQuery is nice, I have some very large
indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
Additionally all of the existing RegexQuery implementations in Lucene are
really slow if there is no constant prefix. This implementation does not
depend upon constant prefix, and runs the same query in 640ms.
Some use cases I envision:
1. lexicography/etc on large text corpora
2. looking for things such as urls where the prefix is not constant (http://
or ftp://)
The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
regular expressions into a DFA. Then, the filter enumerates terms in a
special way, by using the underlying state machine. Here is my short
description from the comments:
The algorithm here is pretty basic. Enumerate terms but instead of a
binary accept/reject do:

1. Look at the portion that is OK (did not enter a reject state in the
DFA)
2. Generate the next possible String and seek to that.
the Query simply wraps the filter with ConstantScoreQuery.
I did not include the automaton.jar inside the patch but it can be downloaded
from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701298#action_12701298
]

Eks Dev commented on LUCENE-1606:
-

hmmm, sounds like good idea, but I am still not convinced it would work for
Fuzzy

take simple dictionary:
one
two
three
four

query Term is, e.g. ana, right? and n=1, means your DFA would be: {.na, a.a,
an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any
character in you alphabet.

For the first element in DFA (in expanded form) you need to visit all terms, no
matter how you walk DFA... or am I missing something?

Where you could save time is actual calculation of LD Matrix for terms that do
not pass automata

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1410) PFOR implementation

2009-03-23 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688284#action_12688284
 ] 

Eks Dev commented on LUCENE-1410:
-

It looks like Google went there as well (Block encoding), 

see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format
http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)



 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688429#action_12688429
 ] 

Eks Dev commented on LUCENE-1561:
-

maybe something along the lines:

usePureBooleanPostings()
minimalInvertedList()




 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678121#action_12678121
]

Eks Dev commented on SOLR-1044:
---

I do not know much about Solr needs there, but we are using one of prehistoric
versions of hadoop RPC (no NIO version) as everything else proved to eat far
to much time (in 800+ rq/sec environment every millisecond counts). Creating
new Sockets is not working there as OSs start having problems to keep up with
this rate (especially with java , slower Socket release due to gc() latency).

We are anyhow contemplating to give etch (or thrift) a try. Etch looks like
really good peace of work, with great flexibility. Someone tried it?

Use Hadoop RPC for inter Solr communication
---

Key: SOLR-1044
URL: https://issues.apache.org/jira/browse/SOLR-1044
Project: Solr
Issue Type: New Feature
Components: search
Reporter: Noble Paul

Solr uses http for distributed search . We can make it a whole lot faster if
we use an RPC mechanism which is more lightweight/efficient.
Hadoop RPC looks like a good candidate for this.
The implementation should just have one protocol. It should follow the Solr's
idiom of making remote calls . A uri + params +[optional stream(s)] . The
response can be a stream of bytes.
To make this work we must make the SolrServer implementation pluggable in
distributed search. Users should be able to choose between the current
CommonshttpSolrServer, or a HadoopRpcSolrServer .

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669595#action_12669595
 ] 

Eks Dev commented on LUCENE-1532:
-

.bq but I'm not sure the exact frequency number at just word-level is really 
that useful for spelling correction, assuming a normal zipfian distribution.

you are probably right, you cannot expect high resolution from frequency, but 
exact frequency information is your source information. Clustering it on 
anything is just one algorithmic modification where, at the end, less 
information remains. Mark suggests 1-10, someone else would be happy with 1-3  
... who could tell? Therefore I would recommend real frequency information and 
leave possibility for end user to decide what to do with it. 

Frequency distribution is not simple measure, depends heavily on corpus 
composition, size. In one corpus doc. frequency of 3 means it is probably a 
typo, in another this means nothing...

My proposal is to work with real frequency as you have no information loss 
there ...  




  


 File based spellcheck with doc frequencies supplied
 ---

 Key: LUCENE-1532
 URL: https://issues.apache.org/jira/browse/LUCENE-1532
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Reporter: David Bowen
Priority: Minor

 The file-based spellchecker treats all words in the dictionary as equally 
 valid, so it can suggest a very obscure word rather than a more common word 
 which is equally close to the misspelled word that was entered.  It would be 
 very useful to have the option of supplying an integer with each word which 
 indicates its commonness.  I.e. the integer could be the document frequency 
 in some index or set of indexes.
 I've implemented a modification to the spellcheck API to support this by 
 defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
 word, and a class which implements the interface by looking up the frequency 
 in an index.  So Lucene users can provide alternative implementations of 
 DocFrequencyInfo.  I could submit this as a patch if there is interest.  
 Alternatively, it might be better to just extend the spellcheck API to have a 
 way to supply the frequencies when you create a PlainTextDictionary, but that 
 would mean storing the frequencies somewhere when building the spellcheck 
 index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-01-30 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669018#action_12669018
]

Eks Dev commented on LUCENE-1532:
-

bq. so it can suggest a very obscure word rather than a more common word which
is equally close to the misspelled word that was entered

in my experience freq information brings there a lot, but is not linear. It is
not always that word with higher frequency makes better suggestion. Common
sense is that high frequency words get often misspelled in different ways in
normal corpus. Making following patterns:

HF(High Freiquency) Word against LF(Low Frequency) that is similar in edit
distance sense is much more likely typo/misspelling than HF vs HF case.

Similar cases with HF vs LF
the against hte
think vs tihnk

Very similar, but HF vs HF
think vs thing

some cases that jump out of these ideas are synonyms, alternative spellings and
very common mistakes. Very tricky to isolate just by using some distance
measure and frequency. Her you need context.
similar and HF vs HF
thomas vs tomas sometimes spelling mistake, sometimes different names...

depends what you are trying to achieve, if you expect mistakes in query you are
good if you assume HF suggestions are better, but if you go for high recall you
need to cover cases where query term is correct you have to dig into your
corpus to find incorrect words (Query think about it should find document
containing tihnk about it)

very challenging problem but cutting to the chase. The proposal is to make
it possible to define
float Function(Edit distance, Query_Token_Freq, Corpus_Token_Freq) that
returns some measure that is higher for more similar pairs considering edit
distance and frequency (value that gets used as condition for priority queue) .
Default could just work as you described. (It is maybe already possible, I did
not look at it).

File based spellcheck with doc frequencies supplied
---

Key: LUCENE-1532
URL: https://issues.apache.org/jira/browse/LUCENE-1532
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/spellchecker
Reporter: David Bowen

The file-based spellchecker treats all words in the dictionary as equally
valid, so it can suggest a very obscure word rather than a more common word
which is equally close to the misspelled word that was entered. It would be
very useful to have the option of supplying an integer with each word which
indicates its commonness. I.e. the integer could be the document frequency
in some index or set of indexes.
I've implemented a modification to the spellcheck API to support this by
defining a DocFrequencyInfo interface for obtaining the doc frequency of a
word, and a class which implements the interface by looking up the frequency
in an index. So Lucene users can provide alternative implementations of
DocFrequencyInfo. I could submit this as a patch if there is interest.
Alternatively, it might be better to just extend the spellcheck API to have a
way to supply the frequencies when you create a PlainTextDictionary, but that
would mean storing the frequencies somewhere when building the spellcheck
index, and I'm not sure how best to do that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-01-12 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663120#action_12663120
]

Eks Dev commented on LUCENE-1518:
-

nice,
you did it top down (api), Paul takes it bottom up (speed).

this makes some really crazy things possible, e.g. implementing normal
TermQuery as a DirectFilter and when the optimization of the BooleanQuery
gets done (no Score calculation, direct usage of DocIdSetIterators) you can
speed up some queries containing TermQuery without really instantiating
Filter. Of course only for cases where tf/idf/norm can be ignored.

Kind of middle-ground between Filter and full ranked TermQuery (better said any
BooleanQuery!), Faster than ranked case due to the switched off score
calculation and more comfortable than Filter usage, no instantiation of
DocIdSet-s...

very nice indeed, smooth mix between ranked and pure boolean model with both
benefits.

Merge Query and Filter classes
--

Key: LUCENE-1518
URL: https://issues.apache.org/jira/browse/LUCENE-1518
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
Attachments: LUCENE-1518.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641128#action_12641128
]

Eks Dev commented on LUCENE-1426:
-

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it
would make sense to use VInts for very short postings and PFOR for the rest. I
just do not remember rationale behind it.

- During omitTf() discussion, we came up with cool idea to actually inline very
short postings into term dict instead of storing offset. This way we spare one
seek per term in many cases, as well as some space for storing offset. I do not
know if this is a problem, but sounds reasonable. With standard Zipfian
distribution, a lot of postings should get inlined. Use cases where we have
query expansion on many terms (think spell checker, synonyms ...) should
benefit from that heavily. These postings are small but there is a lot of them,
so it adds up... seek is deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch
gets over soon so I that I could dedicate some time to fun things like this :)

cheers, eks

Next steps towards flexible indexing

Key: LUCENE-1426
URL: https://issues.apache.org/jira/browse/LUCENE-1426
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1426.patch

In working on LUCENE-1410 (PFOR compression) I tried to prototype
switching the postings files to use PFOR instead of vInts for
encoding.
But it quickly became difficult. EG we currently mux the skip data
into the .frq file, which messes up the int blocks. We inline
payloads with positions which would also mess up the int blocks.
Skipping offsets and TermInfo offsets hardwire the file pointers of
frq prox files yet I need to change these to block + offset, etc.
Separately this thread also started up, on how to customize how Lucene
stores positional information in the index:
http://www.gossamer-threads.com/lists/lucene/java-user/66264
So I decided to make a bit more progress towards flexible indexing
by first modularizing/isolating the classes that actually write the
index format. The idea is to capture the logic of each (terms, freq,
positions/payloads) into separate interfaces and switch the flushing
of a new segment as well as writing the segment during merging to use
the same APIs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted

2008-08-22 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12624634#action_12624634
 ] 

Eks Dev commented on LUCENE-1329:
-

Mike, did someone measure what this brings? 

This practically reduces need to have many IndexReader-s in MT setup when Index 
is used in read only case.






 Remove synchronization in SegmentReader.isDeleted
 -

 Key: LUCENE-1329
 URL: https://issues.apache.org/jira/browse/LUCENE-1329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.4

 Attachments: LUCENE-1329.patch, LUCENE-1329.patch, lucene-1329.patch


 Removes SegmentReader.isDeleted synchronization by using a volatile 
 deletedDocs variable on Java 1.5 platforms.  On Java 1.4 platforms 
 synchronization is limited to obtaining the deletedDocs reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted

2008-08-22 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12624657#action_12624657
 ] 

Eks Dev commented on LUCENE-1329:
-

ok, I see, thanks. 
At least, It resolves an issue completely for RAM based indexes.

We have seen performance drop for RAM based index when we switched to MT setup 
with shared IndexReader, I am not yet sure what is the reason for it,  problems 
in our code or this is indeed related to lucene. I am talking about 25-30% drop 
on 3 Threads on 4-Core CPU.  Must measure it properly...



 Remove synchronization in SegmentReader.isDeleted
 -

 Key: LUCENE-1329
 URL: https://issues.apache.org/jira/browse/LUCENE-1329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.4

 Attachments: LUCENE-1329.patch, LUCENE-1329.patch, lucene-1329.patch


 Removes SegmentReader.isDeleted synchronization by using a volatile 
 deletedDocs variable on Java 1.5 platforms.  On Java 1.4 platforms 
 synchronization is limited to obtaining the deletedDocs reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-19 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623593#action_12623593
 ] 

Eks Dev commented on LUCENE-1219:
-

bq. did you ever measure the before/after performance difference?

sure we did, it's been a while we measured it so I do not have the real numbers 
at hand. But for both cases (indexing and fetching stored binary field)  it 
showed up during profiling as the only easy quick-win(s) we could make . 

We index very short documents and indexing speed  per thread before this patch 
was is in  7.5k documents/ second range, after it we run it with the patch over 
9.5-10K/Second, sweet...

for searching, I do not not remember the numbers, but it was surely above 5% 
range  (try to allocate 12Mb in 6k objects per second as unnecessary addition 
and you will see it  :)



 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-18 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623332#action_12623332
 ] 

Eks Dev commented on LUCENE-1219:
-

how was it: repetitio est mater studiorum ;)

thanks Mike! 



- Original Message 


Send instant messages to your online friends http://uk.messenger.yahoo.com


 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4

 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-09 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.extended.patch

bq. couldn't you just call document.getFieldable(name), and then call 
binaryValue(byte[] result) on that Fieldable, and then get the length from it 
(getBinaryLength()) too? (Trying to minimize API changes).

sure, good tip, I this could work.  No need to have this 
byte[]-Fieldable-byte[] loop, it confuses. I have attached patch that uses 
this approach. But I created getBinaryValue(byte[]) instead of 
binaryValue(byte[]) as we have binaryValue() as deprecated method (would be 
confusing as well). Not really tested, but looks simple enough 

Just thinking aloud
This is one nice feature, but I permanently had a feeling I do not understand 
this Field structures, roles and responsibilities :)  
Field/Fieldable/AbstractField hierarchy is really ripe for good 
re-factoring.This bigamy with index / search use cases makes things not really 
easy to follow, Hoss has right, we need some way to divorce RetrievedField from 
FieldToBeIndexed, they are definitely not the same, just very similar.   

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.extended.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.extended.patch

Mike, 
This new patch includes take3  and adds the following:

Fieldable  Document.getStoredBinaryField(String name, byte[] scratch);

where scratch param represents user byte buffer that will be used in case it is 
big enough, if not, it will be simply allocated like today. If scratch is used, 
you get the same object through Fieldable.getByteValue()


for this to work, I added one new method in Fieldable 
abstract Fieldable getBinaryField(byte[] scratch);

the only interesting implementation is in LazyField 

The reason for this is in my previous comment

this does not affect issues from take3 at all, but is dependant on it, as you 
need to know the length of byte[] you read. take3 remains good to commit, I 
just did not know how to make one isolated patch with only these changes 
without too much work in text editor 
 

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12621036#action_12621036
 ] 

Eks Dev commented on LUCENE-1219:
-

bq. could we instead add this to Field:
byte[] binaryValue(byte[] result)

this is exactly where I started, but then realized I am missing actual length 
we read in LazyField, without it you would have to relocate each time, except 
in case where your buffer length equals toRead in LazyField... simply, the 
question is, how the caller of   byte[] getBinaryValue(String name, byte[] 
result) could know what is the length in this returned byte[]

Am I missing something obvious?

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
 LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.take2.patch, LUCENE-1219.take3.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-05 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12620019#action_12620019
]

Eks Dev commented on LUCENE-1219:
-

Great Mike,
it gets better and better, i saw LUCENE-1340 committed. Thanks to you Grant,
Doug and all others that voted for 1349 this happened so quickly. Trust me,
these two issues are really making my life easier. I pushed decision to add new
hardware to some future point (means, save customer's money now)... a few weeks
later would be too late.

Now it remains only to make one nice patch that enables us to pass our own
byte[] for retrieving stored fields during search. I was thinking along the
lines of things you did in Analyzers.

we could pool the same trick for this, eg.

Field Document.getBinaryValue(String FIELD_NAME, Field destination);

Field already has all access methods (get/set),

the contract would be: If destination==null, new one will be created and
returned, if not we use this one and returne the same object back. The method
should check if byte[] is big enough, if not simple growth policy can be there.
This way we avoid new byte[] each time you fetch stored field..

I did not look exactly at code now, but the last time I was looking into it it
looked as quite simple to do something along these lines. Do you have some
ideas how we could do it better?

Just simple calculation in my case,
average Hits count is around 200, for each hit we have to fetch one stored
field where we do some post-processing, re-scoring and whatnot. Currently we
run max 30 rq/second , with average document length of 2k you lend at 2K * 200
* 30 = 6000 object allocations per second totaling 12Mb ... only to get the
data... I can imagine people with much longer documents (that would be typical
lucene use case) where it gets worse... simply reducing gc() pressure with
really small amount of work. I am sure this would have nice effects on some
other use cases in lucene.

thanks again to all workers behind this greet peace of software...
eks

PS: I need to find some time to peek at paul's work in LUVENE -1345 and my
wish list will be complete, at least for now (at least until you get your magic
with flexi index format done :)

support array/offset/ length setters for Field with binary data
---

Key: LUCENE-1219
URL: https://issues.apache.org/jira/browse/LUCENE-1219
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch,
LUCENE-1219.patch, LUCENE-1219.take2.patch, LUCENE-1219.take3.patch

currently Field/Fieldable interface supports only compact, zero based byte
arrays. This forces end users to create and copy content of new objects
before passing them to Lucene as such fields are often of variable size.
Depending on use case, this can bring far from negligible performance
improvement.
this approach extends Fieldable interface with 3 new methods
getOffset(); gettLenght(); and getBinaryValue() (this only returns reference
to the array)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12617726#action_12617726
 ] 

Eks Dev commented on LUCENE-1345:
-

Yonik, this would probably work fine for int values (on my CPU), I have tried 
it on long values and this was significantly slower on this test... it boils 
down again to what is the CPU we are optimizing for :)

 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, 
 LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12617836#action_12617836
 ] 

Eks Dev commented on LUCENE-1345:
-

bq. comparison with -1 is being optimized away entirely 

I do not think so, how compiler could optimize away the only condition that 
stops the loop? The loop would never finish, or am I misreading something here? 

Anyhow, the test is so simple that compiler can take completely other direction 
from the real case. I guess much better test (without too much effort!) would 
be to take something like OpenBitSetIterator and make one Iterator 
implementation with sentinel approach and then compare... this test is really 
just a dumb loop, but on the other side isolates the difference between two 
approaches...



 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, 
 LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-29 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1345:


Attachment: OpenBitSetIteratorExperiment.java
TestIteratorPerf.java

I just enhanced TestIteratorPerf  to work with 
OpenBitSetIterator(Experiment)... on dense bit sets sentinel based are faster 
(ca 9%), on low density about the same?

Yonik's tip -1  doc instead of -1 != doc still performs worse, and knowing 
Yonik's hunch on these things, I am still not convinced it is really faster ...

Paul's work here is more interesting, clear API and Performance win on many 
fronts... 

practically, no need to pollute this issue more with iterator semantics if I(or 
someone else) figure out something really interesting there, will create new 
issue 

 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.java, DisjunctionDISI.patch, 
 DisjunctionDISI.patch, LUCENE-1345.patch, LUCENE-1345.patch, 
 OpenBitSetIteratorExperiment.java, TestIteratorPerf.java, 
 TestIteratorPerf.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-29 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12617978#action_12617978
 ] 

Eks Dev commented on LUCENE-1340:
-

ouch! it is kind of getting personal between me and Fieldable :) Not the first 
time to get bugged by it!

Due to Fieldable (things really important, at lest to me):  
- We cannot get binary stored Field in and out of lucene without getting gc() 
go crazy
- We cannot omitTF 
 
it would be possible somehow to make it at AbstractField levele and instanceoff 
at a few places, but I simply hate to do it (I will patch my local copy, this 
issue is worth to me... must branch off from the trunk for the first time, sigh)

funny it is, I see no reason to have anything but AbstractField 
(Field/Fieldable are just redundant)

 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
 LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-29 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12618069#action_12618069
]

Eks Dev commented on LUCENE-1340:
-

that sound like consensus :) Great!

in that case LUCENE-1219 can be reworked slightly to avoid instanceoff (less
code). Also it opens a way to pass reference to byte[] for retrieving stored
fields out of lucene and communicating length back to caller (now we new byte[]
every time we fetch stored field)

bq. it's one of my biggest regrets in Lucene (yes, I am responsible for it),
yet I firmly believe there is a way to do interfaces and abstracts in a proper
way in Java.

no need to regret Grant, if you do nothing you make no mistakes... Interfaces
are ok, as long as you can tell what they are going to be doing in next 5
years... this forces you to design for the future... something we cannot
afford in so popular and complex libraries like lucene at places like Field.
Abstract* is equally good design-abstraction...

Proposal:
We could live with a statement Fieldable changes are allowed from now, it is
deprecated and will be probably removed in 3.0 , it causes just a tiny bit of
work in case someone is really implementing it (adding new methods to Fieldable
like omitTf() costs you max 5 minutes work to change your implementing class to
implement it!).

from 3.0 on, I could very well live without it, until then, we cause 5 minutes
work for people that implement Fieldable on their own and want to stay up to
date with the trunk. It is fair deal for everyone and lucene moves forward...

Make it posible not to include TF information in index
--

Key: LUCENE-1340
URL: https://issues.apache.org/jira/browse/LUCENE-1340
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Reporter: Eks Dev
Priority: Minor
Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch,
LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch

Original Estimate: 24h
Remaining Estimate: 24h

Term Frequency is typically not needed for all fields, some CPU (reading one
VInt less and one X1...) and IO can be spared by making pure boolen fields
possible in Lucene. This topic has already been discussed and accepted as a
part of Flexible Indexing... This issue tries to push things a bit faster
forward as I have some concrete customer demands.
benefits can be expected for fields that are typical candidates for Filters,
enumerations, user rights, IDs or very short texts, phone numbers, zip
codes, names...
Status: just passed standard test (compatibility), commited for early review,
I have not tried new feature, missing some asserts and one two unit tests
Complexity: simpler than expected
can be used via omitTf() (who used omitNorms() will know where to find it :)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-28 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1345:


Attachment: TestIteratorPerf.java

Hi Paul, 
I gave it a try on micro benchmarking, and it looks like we could gain a lot by 
switcing to sentinel approach for iterators, apart for being faster they are 
also a bit robuster to one off bugs. 

This test is just a simulation made assuming docId is long (I have tried it 
with int and it is about the same result).

Just attaching it here as I did not want to create new issue for now, before we 
identify if there are some design/performance knock-out criteria.

test on my setup:
32bit java version 1.6.0_10-rc
java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28)
Windows XP Profesional 32bit
notebook, 3Gb RAM, 
CPU x86 Family 6 Model 15 Stepping 11 GenuineIntel ~2194 Mhz

java -server -Xbatch


result (with docID long):
old  milliseconds=6938
old  milliseconds=6953
old  milliseconds=6890
old  milliseconds=6938
old  milliseconds=6906
old  milliseconds=6922
old  milliseconds=6906
old  milliseconds=6938
old  milliseconds=6906
old  milliseconds=6906
old total milliseconds=69203

new  milliseconds=5797
new  milliseconds=5703
new  milliseconds=5266
new  milliseconds=5250
new  milliseconds=5234
new  milliseconds=5250
new  milliseconds=5235
new  milliseconds=5250
new  milliseconds=5250
new  milliseconds=5250
new total milliseconds=53485
New/Old Time 53485/69203 (77.28711%)

all in all, faster more than 22% !!

Of course, this type of benchmark does not mean all iterator ops in real life 
are going to be 20% faster... other things probably dominate, but if it proves 
that this test does not have some flaws (easy possible)... well worth of 
pursuing

cheers, eks 




 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, 
 LUCENE-1345.patch, TestIteratorPerf.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-28 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12617603#action_12617603
 ] 

Eks Dev commented on LUCENE-1345:
-

great! Will look into at at the weekend in more datails.

 I have moved this part to Constructor on my local copy, it passes all tests:

+if (disiDocQueue == null) {
+  initDisiDocQueue();
+}


it is in next() and skipTo()

practically the same as reported in 
https://issues.apache.org/jira/browse/LUCENE-1145, with this, 1145 can be closed



 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, 
 LUCENE-1345.patch, LUCENE-1345.patch, TestIteratorPerf.java




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-27 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1345:


Attachment: DisjunctionDISI.patch

I just realised TestDisjunctionDISI had a bug (iterators have to be 
reinitialized)... 

apart from that only small change in DISIQueue to use constants instead of vars 
(compiler should have done it as well, but you never know) 
 
 private final void downHeap() {
+int i = 1;
+int j = 2; //i  1; // find smaller child
+int k = 3; //j + 1; 
+

 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, DisjunctionDISI.patch, 
 LUCENE-1345.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-26 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12617140#action_12617140
]

Eks Dev commented on LUCENE-1340:
-

we finished our tests

Index without omitTf() :
- 87Mio Documents, 2 indexed Fields one stored field
- Unique terms in index 2.5Mio
- Average Field lengths in tokens: 3.3 and 5.5 (very short fields)
- On Disk size 3.8 Gb total with stored field

Queries under test:
- BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with
minNumberShouldMatch()) . with a lot of clauses (5-100).
- Filter used, yes

Test scope, regression with 30k Queries on the same index with
omitTf(true/false).

Result:

- The Queries returned 100% identical Hits (full recall tested, all hits
checked)!

- Index size reduction(not including stored field!): 7% (short documents =
less positions than in Mike's case)

- Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on
disk setup should bring even more due to the reduced IO for reading postings)

-Indexing performance (FSDisk!) 13% faster

Also, we compared omitTf(false) with this patch and lucene.jar without this
patch, no changes whatsoever.

From my perspective, this is good to go into production. At least for our
usage of lucene, there are no differences with homitTf(true)...

One more thing here: since the tiis are loaded into RAM, that unused
proxPointer wastes 8 bytes for each indexed terms. For indices with alot of
terms this can add up to alot of wasted ram. But still I think we should wait
and fix this as part of flexible indexing, when we maybe refactor the
TermInfos to be column stride instead.

I am more than happy with the results, no need to squeeze the last bit out of
it right now.

Mike, thanks again for the great work!

Make it posible not to include TF information in index
--

Original Estimate: 24h
Remaining Estimate: 24h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1345) Allow Filter as clause to BooleanQuery

2008-07-26 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1345:


Attachment: DisjunctionDISI.patch

bq. Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) 
somewhere?

I have played with DisjunctionSumScorer rip-off, maybe you find it useful for 
this issue...

What would be nice here(and in DisjunctionSumScorer ), if possible?:

- to remove initDISIQueue() from next() and skipTo() (also the same in 
DisjunctionSumScorer()) ... this is due to this ugly  -1 position before first 
call, I just do not know how to get rid of it :)

- to switch to Conjuction mode if minNrShouldMatch kicks in there are 
already todo-s for it arround 
  

if you think you can use it, just go ahead and include it in your patch, I am 
not using this for anything, just wrapped it up when you asked. 

 Allow Filter as clause to BooleanQuery
 --

 Key: LUCENE-1345
 URL: https://issues.apache.org/jira/browse/LUCENE-1345
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Paul Elschot
Priority: Minor
 Attachments: DisjunctionDISI.patch, LUCENE-1345.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-21 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615357#action_12615357
 ] 

Eks Dev commented on LUCENE-1340:
-

Great, it is already more than I expected, even indexing is going to be 
somewhat faster.

I have tried your patch on smallish index with 8Mio documents and it worked on 
our regression test without problems. 
it worked fine with and without omitTf(true), no performance drop or bad 
surprises when we do not use it. Tomorrow is scheduled real test with 
production data, around 80Mio very small documents, with some very extensive 
tests I will report back.

The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in .tii,.tis) and
also in memory because we load *.tii into RAM 

 About this one, it would be nice not to store this as well, but I think the 
pointers are already reduced to one byte, as they are 0 for these cases (are 
they,?) So we have this benefit without expecting it :)

And yes, more column stride is great, if you followed my comments on 
LUCENE-1278, that would mean we could easily inline very short postings into 
term dict (here I expect huge performance benefit, as skip()  on another large 
file is going to be saved independent from omitTf(true)), without increase in 
size (or minimal) of tii (no locality penalty) If we follow Zipfian 
distribution, there is *a lot* of terms with postings shorter than e.g. 16 ... 

Thanks again for your support, without you this patch would be just another 
nice idea :)








 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
 LUCENE-1340.patch, LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-20 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077
 ] 

Eks Dev commented on LUCENE-1278:
-

in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I 
think it is worth mentioning that I am working on LUCENE-1340, that is storing 
postings without additional frq info. 

correct me if I am wrong, the only difference is that this approach with *.frq 
needs one seek more... at the same time, this could potentially increase term 
dict size, so we loose some locality.

Your your last proposal sounds interesting,  inline short postings into term 
dict , so for short postings (about the size of offset pointer into *.frq) with 
tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  
we spare one seek()... this could be a lot. Also, there is no need to store 
postings into *frq  (this complicates maintenance I guess)  

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-20 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

- fixed stupid bug in SegmentTermDocs (was doc = docCode; instead of doc += 
docCode;)
- TestOmitTf extended a bit 


 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, 
 LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-19 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

Thanks Mike, with just a little bit more hand-holding we are going to be there 
:)
 
I *think* I have *.prx IO excluded in case omitTf==true, please have a look, 
this part is really not an easy one (*Merger).

Also, now if a single field has mixed true/false for omitTf, I set it to true.

One unit test is already there, basic use case works, but the test has to cover 
a bit more



 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-18 Thread Eks Dev (JIRA)

Make it posible not to include TF information in index
--

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor


Term Frequency is typically not needed  for all fields, some CPU (reading one 
VInt less and one X1...) and IO can be spared by making pure boolen fields 
possible in Lucene. This topic has already been discussed and accepted as a 
part of Flexible Indexing... This issue tries to push things a bit faster 
forward as I have some concrete customer demands.

benefits can be expected for fields that are typical candidates for Filters, 
enumerations, user rights, IDs or very short texts, phone  numbers, zip 
codes, names...

Status: just passed standard test (compatibility), commited for early review, I 
have not tried new feature, missing some asserts and one two unit tests

Complexity: simpler than expected

can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

2008-07-18 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1340:


Attachment: LUCENE-1340.patch

first cut

 Make it posible not to include TF information in index
 --

 Key: LUCENE-1340
 URL: https://issues.apache.org/jira/browse/LUCENE-1340
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1340.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Term Frequency is typically not needed  for all fields, some CPU (reading one 
 VInt less and one X1...) and IO can be spared by making pure boolen fields 
 possible in Lucene. This topic has already been discussed and accepted as a 
 part of Flexible Indexing... This issue tries to push things a bit faster 
 forward as I have some concrete customer demands.
 benefits can be expected for fields that are typical candidates for Filters, 
 enumerations, user rights, IDs or very short texts, phone  numbers, zip 
 codes, names...
 Status: just passed standard test (compatibility), commited for early review, 
 I have not tried new feature, missing some asserts and one two unit tests
 Complexity: simpler than expected
 can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

2008-03-14 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578656#action_12578656
]

Eks Dev commented on LUCENE-1187:
-

Michael,
I do not think we need to add Factory (for this particular reason), DocIdSet
type should not be assumed as we could come up with smart ways to select
optimal Filter representation depending on doc-id distribution, size...

The only problem we have with is that contrib classes, ChainedFilter and
BooleanFilter assume BitSet.
And the solution for this would be to add just a few methods to the DocIdSet
that are able to do AND/OR/NOT on DocIdSet[] using DocIdSetIterator()
e.g.
DocIdSet or(DocIdSet[], int minimumShouldMatch);
DocIdSet or(DocIdSet[]);

Optimized code for these basic operations *already exists*, can be copied from
Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply
stripping-off scoring part.

with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter
to work with DocIdSet (and that works on all implementations of
Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can
be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we
do not need both of them.

Unfortunately I do not have time to play with it next 3-4 weeks, but should be
no more than 2 days work (remember, we have difficult part already done in
Scorers). Having so much code duplication is not something really good, but we
can then later merge these somehow.

Things to be done now that Filter is independent from BitSet

Key: LUCENE-1187
URL: https://issues.apache.org/jira/browse/LUCENE-1187
Project: Lucene - Java
Issue Type: Improvement
Reporter: Paul Elschot
Priority: Minor
Attachments: ChainedFilterAndCachingFilterTest.patch,
javadocsZero2Match.patch

(Aside: where is the documentation on how to mark up text in jira comments?)
The following things are left over after LUCENE-584 :
For Lucene 3.0 Filter.bits() will have to be removed.
There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the
boolean behaviour of a Filter.
I have not looked into Filter caching yet, but I suppose there will be some
room for improvement there.
Iirc the current core has moved to use OpenBitSetFilter and that is probably
what is being cached.
In some cases it might be better to cache a SortedVIntList instead.
Boolean logic on DocIdSetIterator is already available for Scorers (that
inherit from DocIdSetIterator) in the search package. This is currently
implemented by ConjunctionScorer, DisjunctionSumScorer,
ReqOptSumScorer and ReqExclScorer.
Boolean logic on BitSets is available in contrib/misc and contrib/queries
DisjunctionSumScorer calls score() on its subscorers before the score value
actually needed.
This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as
a superclass of DisjunctionSumScorer.
To fully implement non scoring queries a TermDocIdSetIterator will be needed,
perhaps as a superclass of TermScorer.
The javadocs in org.apache.lucene.search using matching vs non-zero score:
I'll investigate this soon, and provide a patch when necessary.
An early version of the patches of LUCENE-584 contained a class Matcher,
that differs from the current DocIdSet in that Matcher has an explain()
method.
It remains to be seen whether such a Matcher could be useful between
DocIdSet and Scorer.
The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
This was also discussed at another issue recently, so perhaps it is wortwhile
to open a separate issue for this.
Skipping on a SortedVIntList is done using linear search, this could be
improved by adding multilevel skiplist info much like in the Lucene index for
documents containing a term.
One comment by me of 3 Dec 2008:
A few complete (test) classes are deprecated, it might be good to add the
target release for removal there.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-12 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.patch

latest patch updated to the trunk (Lucene-1217 is there. Michael you did not 
mark it as resolved.) 



 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
 LUCENE-1219.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1217:


Attachment: LUCENE-1217.patch

 use isBinary cached variable instead of instanceof in Filed
 ---

 Key: LUCENE-1217
 URL: https://issues.apache.org/jira/browse/LUCENE-1217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1217.patch


 Filed class can hold three types of values, 
 See: AbstractField.java  protected Object fieldsData = null; 
 currently, mainly RTTI (instanceof) is used to determine the type of the 
 value stored in particular instance of the Field, but for binary value we 
 have mixed RTTI and cached variable boolean isBinary 
 This patch makes consistent use of cached variable isBinary.
 Benefit: consistent usage of method to determine run-time type for binary 
 case  (reduces chance to get out of sync on cached variable). It should be 
 slightly faster as well.
 Thinking aloud: 
 Would it not make sense to maintain type with some integer/bytepoor man's 
 enum (Interface with a couple of constants)
 code:java{
 public static final interface Type{
 public static final byte BOOLEAN = 0;
 public static final byte STRING = 1;
 public static final byte READER = 2;
 
 }
 }
 and use that instead of isBinary + instanceof? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed

2008-03-11 Thread Eks Dev (JIRA)

use isBinary cached variable instead of instanceof in Filed
---

 Key: LUCENE-1217
 URL: https://issues.apache.org/jira/browse/LUCENE-1217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Eks Dev
Priority: Trivial


Filed class can hold three types of values, 
See: AbstractField.java  protected Object fieldsData = null; 

currently, mainly RTTI (instanceof) is used to determine the type of the value 
stored in particular instance of the Field, but for binary value we have mixed 
RTTI and cached variable boolean isBinary 

This patch makes consistent use of cached variable isBinary.

Benefit: consistent usage of method to determine run-time type for binary case  
(reduces chance to get out of sync on cached variable). It should be slightly 
faster as well.

Thinking aloud: 
Would it not make sense to maintain type with some integer/bytepoor man's 
enum (Interface with a couple of constants)
code:java{
public static final interface Type{
public static final byte BOOLEAN = 0;
public static final byte STRING = 1;
public static final byte READER = 2;

}
}

and use that instead of isBinary + instanceof? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)

support array/offset/ length setters for Field with binary data
---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Minor


currently Field/Fieldable interface supports only compact, zero based byte 
arrays. This forces end users to create and copy content of new objects before 
passing them to Lucene as such fields are often of variable size. Depending on 
use case, this can bring far from negligible  performance  improvement. 

this approach extends Fieldable interface with 3 new methods   
getOffset(); gettLenght(); and getBinaryValue() (this only returns reference to 
the array)

   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.patch

all tests pass with this patch. 
 some polish needed and probably more testing, TODOs:

- someone pedantic should check if these new set / get methods should be named 
better 
- check if there are more places where this new feature cold/should be used, I 
think I have changed all of them but one place, direct subclass FieldForMerge 
in FieldsReader, this is the code I do not know so I did not touch it...
-  javadoc  is poor 

should be enough to get us started.

the only pseudo-issue  I see is that 
public byte[] binaryValue(); now creates byte[] and copies content into it, 
reference to original array can be now fetched via getBinaryValue() method... 
this is to preserve compatibility as users expect compact, zero based array 
from this method and we keep offset/length in Field now
this is pseudo issue as users already should have a reference to this array, 
so this method is rather superfluous for end users.

 




 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1219.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.patch

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Minor
 Attachments: LUCENE-1219.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.patch

Michael McCandless had some nice ideas on how to make  getValue() change 
performance penalty for legacy usage negligible, this patch includes them: 
- deprecates getValue() method 
- returns direct reference if offset==0  length == data.length

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.patch, LUCENE-1219.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed

2008-03-11 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577591#action_12577591
 ] 

Eks Dev commented on LUCENE-1217:
-

thanks fof looking into it!
Subclassing now with backwards compatibility would be clumsy, I was thinking 
about it but could not find clean way to make it.

Or we could wait until Java 5 (3.0) and use real enums?
yes, that is ultimate solution, but my line of thoughts was that poor man's 
enum-java 5 enum migration would be trivial later... but do not change 
working code kicks-in here :)  

 use isBinary cached variable instead of instanceof in Filed
 ---

 Key: LUCENE-1217
 URL: https://issues.apache.org/jira/browse/LUCENE-1217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Attachments: LUCENE-1217.patch


 Filed class can hold three types of values, 
 See: AbstractField.java  protected Object fieldsData = null; 
 currently, mainly RTTI (instanceof) is used to determine the type of the 
 value stored in particular instance of the Field, but for binary value we 
 have mixed RTTI and cached variable boolean isBinary 
 This patch makes consistent use of cached variable isBinary.
 Benefit: consistent usage of method to determine run-time type for binary 
 case  (reduces chance to get out of sync on cached variable). It should be 
 slightly faster as well.
 Thinking aloud: 
 Would it not make sense to maintain type with some integer/bytepoor man's 
 enum (Interface with a couple of constants)
 code:java{
 public static final interface Type{
 public static final byte BOOLEAN = 0;
 public static final byte STRING = 1;
 public static final byte READER = 2;
 
 }
 }
 and use that instead of isBinary + instanceof? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577597#action_12577597
]

Eks Dev commented on LUCENE-1219:
-

I do not know for sure if this is something we could not live with. Adding new
interface sounds equally bad, would work nicely, but I do not like it as it
makes code harder to follow with too many interfaces ... I'll have another
look at it to see if there is a way to do it without interface changes. Any
ideas?

support array/offset/ length setters for Field with binary data
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed

2008-03-11 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577601#action_12577601
 ] 

Eks Dev commented on LUCENE-1217:
-

hah, this bug just  justified this patch :) 
sorry,  I should have run tests before... nothing is trivial enough.   
 The problem was indeed isBinary that went out of sync in LazyField, new patch 
follows 

 use isBinary cached variable instead of instanceof in Filed
 ---

 Key: LUCENE-1217
 URL: https://issues.apache.org/jira/browse/LUCENE-1217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Attachments: LUCENE-1217.patch


 Filed class can hold three types of values, 
 See: AbstractField.java  protected Object fieldsData = null; 
 currently, mainly RTTI (instanceof) is used to determine the type of the 
 value stored in particular instance of the Field, but for binary value we 
 have mixed RTTI and cached variable boolean isBinary 
 This patch makes consistent use of cached variable isBinary.
 Benefit: consistent usage of method to determine run-time type for binary 
 case  (reduces chance to get out of sync on cached variable). It should be 
 slightly faster as well.
 Thinking aloud: 
 Would it not make sense to maintain type with some integer/bytepoor man's 
 enum (Interface with a couple of constants)
 code:java{
 public static final interface Type{
 public static final byte BOOLEAN = 0;
 public static final byte STRING = 1;
 public static final byte READER = 2;
 
 }
 }
 and use that instead of isBinary + instanceof? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1217) use isBinary cached variable instead of instanceof in Filed

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1217:


Attachment: Lucene-1217-take1.patch

new patch, fixes isBinary status in LazyField

 use isBinary cached variable instead of instanceof in Filed
 ---

 Key: LUCENE-1217
 URL: https://issues.apache.org/jira/browse/LUCENE-1217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Attachments: Lucene-1217-take1.patch, LUCENE-1217.patch


 Filed class can hold three types of values, 
 See: AbstractField.java  protected Object fieldsData = null; 
 currently, mainly RTTI (instanceof) is used to determine the type of the 
 value stored in particular instance of the Field, but for binary value we 
 have mixed RTTI and cached variable boolean isBinary 
 This patch makes consistent use of cached variable isBinary.
 Benefit: consistent usage of method to determine run-time type for binary 
 case  (reduces chance to get out of sync on cached variable). It should be 
 slightly faster as well.
 Thinking aloud: 
 Would it not make sense to maintain type with some integer/bytepoor man's 
 enum (Interface with a couple of constants)
 code:java{
 public static final interface Type{
 public static final byte BOOLEAN = 0;
 public static final byte STRING = 1;
 public static final byte READER = 2;
 
 }
 }
 and use that instead of isBinary + instanceof? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-03-11 Thread Eks Dev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.patch

this one keeps addition of new methods localized to AbstractField, does not 
change Fieldable interface... it looks like it could work done this way with a 
few instanceof checks in  FieldsWriter, This one has dependency on LUCENE-1217 

it will not give you any benefit if you directly implement your Fieldable 
without extending AbstractField, therefore   I would suggest to eventually  
change Fieldable to support all these methods that operate with offset/length. 
Or someone clever finds some way to change an interface without braking 
backwards compatibility :)

 support array/offset/ length setters for Field with binary data
 ---

 Key: LUCENE-1219
 URL: https://issues.apache.org/jira/browse/LUCENE-1219
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch


 currently Field/Fieldable interface supports only compact, zero based byte 
 arrays. This forces end users to create and copy content of new objects 
 before passing them to Lucene as such fields are often of variable size. 
 Depending on use case, this can bring far from negligible  performance  
 improvement. 
 this approach extends Fieldable interface with 3 new methods   
 getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
 to the array)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

1 2 >

1 - 100 of 128 matches

Mail list logo