[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-11-12 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Description: 
h2. Context
\*suffix and \*infix\* searches on large indexes. 

h2. Problem
Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
shuddering to think about EdgeNGrams...

h2. Proposal 
_DR_-Y- postings


  was:
h2. Context
\*suffix and \*infix\* searches on large indexes. 

h2. Problem
Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
shuddering to think about EdgeNGrams...

h2. Proposal 
_DRY_



> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, bench-byte-array-long.out, bench-byte-array2.out, 
> benchmark-1m.out, byterefshash-bench.txt
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DR_-Y- postings



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-11-05 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: byterefshash-bench.txt
LUCENE-7863.patch

[^LUCENE-7863.patch] with more efficient ngram mapping with BytesRefHash. 
[^byterefshash-bench.txt] benchmark results on i5 and ssd. 
h2. Summary
{code}
> Report sum by Prefix (index) and Round (2 about 2 out of 18)
Operation   round work dir  src  flush  cdc 
 runCnt   recsPerRunrec/selapsedSec avgUsedMem  avgTotalMem 
directory size, Mb
index   0 edge  EnwikiEdgeContentSource  000.00 Lucene70Codec   
 1   501619.778,067.48   1,397,385,984  1,512,570,880   
   23,783.00
index   1 deriv EnwikiEmptyEdgeContentSource 50.00  
DeriveBodyRevCodec   1   501571.408,750.42   2,095,467,008  
6,383,730,688   7,687.00

 > Report sum by Prefix (search) and Round (2 about 2 out of 18)
Operation   round work dir  src  flush cdc  
runCnt   recsPerRun   rec/s  elapsedSecavgUsedMemavgTotalMem 
directory size, Mb
search_50   0 edge  EnwikiEdgeContentSource 000.00 Lucene70Codec
1   5017.632.84 1,291,492,352  1,510,998,016
  23,783.00
search_50   1 deriv EnwikiEmptyEdgeContentSource 50.00 
DeriveBodyRevCodec   1   50 5.978.38 2,205,875,712  
6,383,730,688   7,687.00
{code}

* indexing deriving terms is 8% slower than edge-ngramms
* heap consumption for indexing is 4 times greater (1.5 G vs 6.4)
* index size more than 3 times smaller. I expect bigger gain on regular indices.
* search throughput is 3 lower with derivative terms. But it's only few cold 
searches. There is a reason why searching wildcards on deriving terms is slower 
- it's random reads; however at some point absence of repeating postings should 
pays back and let it search faster eg when index isn't fully mapped.


> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, bench-byte-array-long.out, bench-byte-array2.out, 
> benchmark-1m.out, byterefshash-bench.txt
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-23 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: bench-byte-array-long.out

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: bench-byte-array2.out, bench-byte-array-long.out, 
> benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-23 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: bench-byte-array-long.out

[^bench-byte-array-long.out] here is the long test log evaluated larger ram 
buffer for derivative terms. Here is the summary. 
* derivative terms are indexed 25% slower than edgeNgramms (see below) 
* they significantly reduces index size. For a usual case, the gain would be 
bigger, since here we have multi language docs that make postings shorter 
* derivative terms roughly double ram consumption for indexing (see below) 
* searching for derivative terms is 30..60%% slower since it's required to 
gather randomly distributed postings.

Indexing can be optimized with using BytesRefHash for collecting multivalue 
mapping:
{code}
EdgeNGramm -> {postingOffset}
{code}. 
It also allows appending EdgeNGramms with the least number of bytes to make 
unique entries from them. Now, it wastefully appends every EdgeNGramm with 5 
bytes.

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: bench-byte-array2.out, benchmark-1m.out, 
> LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-23 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: (was: bench-byte-array-long.out)

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: bench-byte-array2.out, benchmark-1m.out, 
> LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-22 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: bench-byte-array2.out

Here is solid benchmark log [^bench-byte-array2.out] with running both rounds 
one by one: edgeNgram then derivativeTerms.
composing same result table again:
|round|indexing, sec|search req/sec|ram total, GB |index size, GB| 
| EdgeNGramm |5,890.05|61.55|2,7|23|
|derived edges|6,981.87|26.51|11.5|8.4|

It's somewhat different. Derived terms indexing is slower, probably because of 
really small RAM buffer, which I set earlier. Search time is 3 times slower. 

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: bench-byte-array2.out, benchmark-1m.out, 
> LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-21 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch] replaces TreeMap to BytesRefArray see 
{{ByteArrayDerivativeWriter.java}}. Here are results for 5M docs
|round|indexing, mins|search req/sec|ram total, GB |index size, GB| 
| EdgeNGramm |85|27.82|2.3|23|
|derived edges|51|7.22|5.5|9.1|
We have index size and even index time gain that costs some ram as it's 
expected. 
EdgeNGramm cache can be made a little bit more compact. The trick is to append 
something to edgegramm to make it unique. 
The interesting thing is the 3 times slower search time, I suppose that posting 
offsets obtained during term expansion could be sorted before reading postings. 
 

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-19 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: benchmark-1m.out

I've run benchmark for 1M wiki docs [^benchmark-1m.out]. Turns out, a memory 
consumption for derivative terms (or for the current impl at least) is 
terrific. So, I couldn't run even 4M benchmark on 16G laptop. Therefore, using 
ByteRefsHash is absolutely necessary (current code is pretty dumb). Also, I've 
realized that terms are derived for every merge, but I have no idea how to 
avoid it. 

Here is the comparison on 1M wiki with url terms excluded.  

|round|indexing, mins|search req/sec|ram total, GB |index size, GB| 
| EdgeNGramm |25|81.04|1.9|6.3|
|derived edges|18|35.31|10.2|2.0|

So, far search results don't match side by side, but I'm not sure whether they 
are expected to match in benchmark. A good random test is necessary (fwiw, 
existing test actually tests nothing).  

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-18 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch] properly instantiates offset buffer. 

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-17 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch] is much friendly for review.  

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-16 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

ok. [^LUCENE-7863.patch] should pass the benchmark. 
The thing is, there should be single posting format since different formats 
write into the different suffixes.
The idea to write negative offsets (Zlong) conditionally turns out to be a dead 
end. Hold on reviews until I change format to write Zlong for all fields 
metadata.  

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, 
> LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-15 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch] has significant fixes for codec registration.
- it looks like the large enough term dictionary hit some code path in 
{{IntersectingTermsEnum}} which is broken due to introduced index format 
changes.
- it's reproduced with {{derivative-terms-only.alg}}
{code}
java.io.EOFException: seek past EOF: 
MMapIndexInput(path="...lucene-solr/lucene/benchmark/deriv/index/_0_Lucene50HijackInjector_0.doc")
at 
org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek(ByteBufferIndexInput.java:366)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockDocsEnum.reset(Lucene50PostingsReader.java:306)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.postings(Lucene50PostingsReader.java:210)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.postings(SegmentTermsEnum.java:1006)
at 
org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:166)
{code}
- overall, the idea to just change Vlong to Zlong through overriding turns out 
not really good, it leads to many changes removes incapsulation and final that 
means there is no any sense in them.   

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-15 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: (was: LUCENE-7863.patch)

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-15 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: (was: LUCENE-7863.patch)

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-15 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-15 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch] benchmark fixes. 



> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-09-14 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

one severe fix [^LUCENE-7863.patch], benchmark is in progress.

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-08-27 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

[^LUCENE-7863.patch]
Move test {{bencmark}} to resolve dependency. Started to work on benchmark, WIP 
  

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, 
> LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-08-21 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

infix proof case 

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-08-20 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Attachment: LUCENE-7863.patch

WIP [^LUCENE-7863.patch]
It introduces a codec with two posting formats:
# hijacking PF which stores posting offsets for original terms
# injecting PF which reverses terms and supplies offset to the original terms 
postings (here is the only file format is changed - it's written with Zlong 
since these offset deltas are negative)
It has to break into any private and final members that blow up the patch.

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard, LUCENE-7863.patch
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc

2017-06-19 Thread Mikhail Khludnev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated LUCENE-7863:
-
Summary: Don't repeat postings (and perhaps positions) on ReverseWF, 
EdgeNGram, etc(was: Don't repeat postings and positions on ReverseWF, 
EdgeNGram, etc  )

> Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc  
> 
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mikhail Khludnev
> Attachments: LUCENE-7863.hazard
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org