[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Description: h2. Context \*suffix and \*infix\* searches on large indexes. h2. Problem Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm shuddering to think about EdgeNGrams... h2. Proposal _DR_-Y- postings was: h2. Context \*suffix and \*infix\* searches on large indexes. h2. Problem Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm shuddering to think about EdgeNGrams... h2. Proposal _DRY_ > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, bench-byte-array-long.out, bench-byte-array2.out, > benchmark-1m.out, byterefshash-bench.txt > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DR_-Y- postings -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: byterefshash-bench.txt LUCENE-7863.patch [^LUCENE-7863.patch] with more efficient ngram mapping with BytesRefHash. [^byterefshash-bench.txt] benchmark results on i5 and ssd. h2. Summary {code} > Report sum by Prefix (index) and Round (2 about 2 out of 18) Operation round work dir src flush cdc runCnt recsPerRunrec/selapsedSec avgUsedMem avgTotalMem directory size, Mb index 0 edge EnwikiEdgeContentSource 000.00 Lucene70Codec 1 501619.778,067.48 1,397,385,984 1,512,570,880 23,783.00 index 1 deriv EnwikiEmptyEdgeContentSource 50.00 DeriveBodyRevCodec 1 501571.408,750.42 2,095,467,008 6,383,730,688 7,687.00 > Report sum by Prefix (search) and Round (2 about 2 out of 18) Operation round work dir src flush cdc runCnt recsPerRun rec/s elapsedSecavgUsedMemavgTotalMem directory size, Mb search_50 0 edge EnwikiEdgeContentSource 000.00 Lucene70Codec 1 5017.632.84 1,291,492,352 1,510,998,016 23,783.00 search_50 1 deriv EnwikiEmptyEdgeContentSource 50.00 DeriveBodyRevCodec 1 50 5.978.38 2,205,875,712 6,383,730,688 7,687.00 {code} * indexing deriving terms is 8% slower than edge-ngramms * heap consumption for indexing is 4 times greater (1.5 G vs 6.4) * index size more than 3 times smaller. I expect bigger gain on regular indices. * search throughput is 3 lower with derivative terms. But it's only few cold searches. There is a reason why searching wildcards on deriving terms is slower - it's random reads; however at some point absence of repeating postings should pays back and let it search faster eg when index isn't fully mapped. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, bench-byte-array-long.out, bench-byte-array2.out, > benchmark-1m.out, byterefshash-bench.txt > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: bench-byte-array-long.out > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: bench-byte-array2.out, bench-byte-array-long.out, > benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: bench-byte-array-long.out [^bench-byte-array-long.out] here is the long test log evaluated larger ram buffer for derivative terms. Here is the summary. * derivative terms are indexed 25% slower than edgeNgramms (see below) * they significantly reduces index size. For a usual case, the gain would be bigger, since here we have multi language docs that make postings shorter * derivative terms roughly double ram consumption for indexing (see below) * searching for derivative terms is 30..60%% slower since it's required to gather randomly distributed postings. Indexing can be optimized with using BytesRefHash for collecting multivalue mapping: {code} EdgeNGramm -> {postingOffset} {code}. It also allows appending EdgeNGramms with the least number of bytes to make unique entries from them. Now, it wastefully appends every EdgeNGramm with 5 bytes. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: bench-byte-array2.out, benchmark-1m.out, > LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: (was: bench-byte-array-long.out) > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: bench-byte-array2.out, benchmark-1m.out, > LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: bench-byte-array2.out Here is solid benchmark log [^bench-byte-array2.out] with running both rounds one by one: edgeNgram then derivativeTerms. composing same result table again: |round|indexing, sec|search req/sec|ram total, GB |index size, GB| | EdgeNGramm |5,890.05|61.55|2,7|23| |derived edges|6,981.87|26.51|11.5|8.4| It's somewhat different. Derived terms indexing is slower, probably because of really small RAM buffer, which I set earlier. Search time is 3 times slower. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: bench-byte-array2.out, benchmark-1m.out, > LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] replaces TreeMap to BytesRefArray see {{ByteArrayDerivativeWriter.java}}. Here are results for 5M docs |round|indexing, mins|search req/sec|ram total, GB |index size, GB| | EdgeNGramm |85|27.82|2.3|23| |derived edges|51|7.22|5.5|9.1| We have index size and even index time gain that costs some ram as it's expected. EdgeNGramm cache can be made a little bit more compact. The trick is to append something to edgegramm to make it unique. The interesting thing is the 3 times slower search time, I suppose that posting offsets obtained during term expansion could be sorted before reading postings. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: benchmark-1m.out I've run benchmark for 1M wiki docs [^benchmark-1m.out]. Turns out, a memory consumption for derivative terms (or for the current impl at least) is terrific. So, I couldn't run even 4M benchmark on 16G laptop. Therefore, using ByteRefsHash is absolutely necessary (current code is pretty dumb). Also, I've realized that terms are derived for every merge, but I have no idea how to avoid it. Here is the comparison on 1M wiki with url terms excluded. |round|indexing, mins|search req/sec|ram total, GB |index size, GB| | EdgeNGramm |25|81.04|1.9|6.3| |derived edges|18|35.31|10.2|2.0| So, far search results don't match side by side, but I'm not sure whether they are expected to match in benchmark. A good random test is necessary (fwiw, existing test actually tests nothing). > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: benchmark-1m.out, LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] properly instantiates offset buffer. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] is much friendly for review. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch ok. [^LUCENE-7863.patch] should pass the benchmark. The thing is, there should be single posting format since different formats write into the different suffixes. The idea to write negative offsets (Zlong) conditionally turns out to be a dead end. Hold on reviews until I change format to write Zlong for all fields metadata. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, > LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] has significant fixes for codec registration. - it looks like the large enough term dictionary hit some code path in {{IntersectingTermsEnum}} which is broken due to introduced index format changes. - it's reproduced with {{derivative-terms-only.alg}} {code} java.io.EOFException: seek past EOF: MMapIndexInput(path="...lucene-solr/lucene/benchmark/deriv/index/_0_Lucene50HijackInjector_0.doc") at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek(ByteBufferIndexInput.java:366) at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockDocsEnum.reset(Lucene50PostingsReader.java:306) at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.postings(Lucene50PostingsReader.java:210) at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.postings(SegmentTermsEnum.java:1006) at org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:166) {code} - overall, the idea to just change Vlong to Zlong through overriding turns out not really good, it leads to many changes removes incapsulation and final that means there is no any sense in them. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: (was: LUCENE-7863.patch) > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: (was: LUCENE-7863.patch) > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] benchmark fixes. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch one severe fix [^LUCENE-7863.patch], benchmark is in progress. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch [^LUCENE-7863.patch] Move test {{bencmark}} to resolve dependency. Started to work on benchmark, WIP > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, > LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch infix proof case > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Attachment: LUCENE-7863.patch WIP [^LUCENE-7863.patch] It introduces a codec with two posting formats: # hijacking PF which stores posting offsets for original terms # injecting PF which reverses terms and supplies offset to the original terms postings (here is the only file format is changed - it's written with Zlong since these offset deltas are negative) It has to break into any private and final members that blow up the patch. > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard, LUCENE-7863.patch > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7863) Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc
[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated LUCENE-7863: - Summary: Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc(was: Don't repeat postings and positions on ReverseWF, EdgeNGram, etc ) > Don't repeat postings (and perhaps positions) on ReverseWF, EdgeNGram, etc > > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org