Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-15 Thread Michael McCandless
Thanks Jeven, more response inlined below: On Tue, May 14, 2024 at 12:58 PM Jerven Tjalling Bolleman wrote: The index that had an issue when merging into one segment definitely had > more than 1 billion times the word "positional" in it. I hope to be able > to give a closer number once

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-14 Thread Michael McCandless
I think we should at least open an issue to try to improve the exception message? We might catch the exception higher up (where we know the field name) and rethrow with the field name, maybe. We can discuss options on the issue ... If you are not using custom term frequencies it's not clear to

Re: recommended index size

2024-01-04 Thread Michael McCandless
Hi Vincent, Lucene has a hard limit of ~2.1 B documents in a single index; hopefully you hit the ~50 - 100 GB limit well before that. Otherwise it's very application dependent: how much latency can you tolerate during searching, how fast are the underlying IO devices at random and large

Re: Performance changes within the Lucene 8 branch

2023-12-14 Thread Michael McCandless
Hi Marc, How are you retrieving your hits? Lucene's stored fields, or doc values, or both? Do you sort the hits docids and then retrieve them in docid order (NOT in the sorted order Lucene returned them in)? I think that might be faster as Lucene's stored fields use block compression and if

Re: Consistent NRT searching with SearcherLifetimeManager and multiple instances

2023-12-14 Thread Michael McCandless
Hi Steven, Great question! I'm so glad to hear your app is providing consistent pagination :) I've long felt Lucene (with NRT segment replication) could do a great job at this, yet so few apps manage to implement it. Every time I interact with a search engine and go to the next page it irks me

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless
e has to > "connect" it with a TaxonomyWriter > > FacetsConfig config = new FacetsConfig(); > DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir); > indexWriter.addDocument(config.build(taxoWriter, doc)); > > right? > > Thanks > > Michael &

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless
There are some differences. StringField is indexed into the inverted index (postings) so you can do efficient filtering. You can also store in stored fields to retrieve. FacetField does everything StringField does (filtering, storing (maybe?)), but in addition it stores data for faceting. I.e.

Re: Lucene Index Writer in a distributed system

2023-10-19 Thread Michael McCandless
Hi Gopal, Indeed, for a single Lucene index, only one writer may be open at a time. Lucene tries to catch you if you mess this up, using file-based locking. If you really need concurrent indexing, you could have N IndexWriters each writing into a private Directory, and then periodically use

Re: Reindexing leaving behind 0 live doc segments

2023-08-31 Thread Michael McCandless
Hi Rahul, Please do not pursue Approach 2 :) ReadersAndUpdates.release is not something the application should be calling. This path can only lead to pain. It sounds to me like something in Solr is holding an old reader (maybe the last commit point, or reader prior to the refresh after you

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need

2023-08-31 Thread Michael McCandless
Thanks Michael, very interesting! I of course agree that Lucene is all you need, heh ;) Jimmy Lin also tweeted about the strength of Lucene's HNSW: https://twitter.com/lintool/status/1681333664431460353?s=20 Mike McCandless http://blog.mikemccandless.com On Thu, Aug 31, 2023 at 3:31 AM

Re: LuceneTestCase altered the default query cache policy

2023-06-27 Thread Michael McCandless
Hi Yuan, [Disclaimer: I work in the same team at Amazon, customer facing product search, where we heavily use Lucene at high scale!] LuceneTestCase already has similar assertions, e.g. to confirm that no system properties were changed, no threads leaked, not too much static objects left

Re: Lucene in action

2023-06-10 Thread Michael McCandless
Hi Vimal, Indeed I think it is unlikely I have the energy for a 3rd edition ... but anyone can drive the 3rd edition, not just the prior authors. New authors welcome! > Since 2nd edition ( based on lucene 4), I'm sorry to say that 2nd edition is based on Lucene 3.0 not 4! It's even older than

Re: Analyzer.createComponents(String fieldname) only being called once, when indexing multiple documents

2023-06-09 Thread Michael McCandless
Hi Usman, Long ago Lucene switched to reusing these analysis components (per Analyzer, per thread), so that explains why createComponents is called once. However, the reuse policy is controllable (expert usage), so in theory you could implement an Analyzer.ReuseStrategy that never reuses and

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-09 Thread Michael McCandless
I'd also love to understand this: > using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on Windows for our index sizes which commonly run north of 1 TB) Is this a known problem on certain versions of Windows? Normally memory mapped IO can scale to very large sizes (well beyond

Re: Info required on licensing of Lucene component

2023-05-18 Thread Michael McCandless
gt; > > > We do see the fix included in Lucene 9.6.0. > > Appreciate your prompt response and thank you so much for resolving the > issue! > > > > Regards, > > Open Source Request Team > > > > *From:* Michael McCandless > *Sent:* 11 May

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Michael McCandless
Hi Jerry, I agree, that makes no sense! Maybe the stopload loader should ignore truly blank lines? Also, the comments on lines 57 and 59 are confusing -- there are no (default) English and Chinese stopwords in the file. I guess they are placeholders. Could you open an issue in Lucene's GitHub

Re: Info required on licensing of Lucene component

2023-05-11 Thread Michael McCandless
cking-your-work-with-issues/linking-a-pull-request-to-an-issue I'll start a separate thread ... Mike McCandless http://blog.mikemccandless.com On Wed, May 10, 2023 at 12:28 PM Michael McCandless < luc...@mikemccandless.com> wrote: > Hello, > > That's a great question, and, look

Re: Info required on licensing of Lucene component

2023-05-10 Thread Michael McCandless
om/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0 > > > > Hence, just wanted to confirm exactly which Lucene release is the > update/pull request applied to? > > > > > > Thanks, > > Open Source Request Team > > > > > > *From:* Michael Mc

Re: Info required on licensing of Lucene component

2023-04-06 Thread Michael McCandless
> In that case, can you’ll update your source repo for Lucene to exclude references to ‘junit’ from Notices.txt file since it is something which is not part of distribution for Lucene. That sounds reasonable to me. I'll open an issue in our GitHub repo, but IANAL and I'm not sure how to

Re: Info required on licensing of Lucene component

2023-04-04 Thread Michael McCandless
Hello, You maybe missed the two responses already to the email, since by default responses only go the the user list not back to the individual. See the archived responses here: https://lists.apache.org/thread/zg01tkq8wtmym27q3dolcg1msbtoxoxl Mike McCandless http://blog.mikemccandless.com On

Re: Vector Search on Lucene

2023-03-16 Thread Michael McCandless
Note that Lucene's demo package (IndexFiles.java, SearchFiles.java) also show examples of how to index and search KNN vectors. Mike McCandless http://blog.mikemccandless.com On Thu, Mar 2, 2023 at 4:46 AM Michael Wechner wrote: > Hi Marcos > > The indexing looks kind of > > Document doc =new

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-25 Thread Michael McCandless
uot;fix-version"), please review this >> manual. >> > >> https://github.com/apache/lucene/blob/main/dev-docs/github-issues-howto.md >> > >> > Tomoko >> > >> > >> > 2022年8月22日(月) 19:46 Michael McCandless : >> >> >>

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-22 Thread Michael McCandless
Wooot! Thank you so much Tomoko!! Mike On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida wrote: > > > Issue migration has been started. Jira is now read-only. > > GitHub issue is available for new issues. > > - You should open new issues on GitHub. E.g. >

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
OK done: https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1 Mike McCandless http://blog.mikemccandless.com On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar wrote: > I think so. > Best regards > -- > *From:* Michae

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
Thanks Baris, And your Jira ID is bkazar right? Mike On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar wrote: > My github username is bmkazar > can You please register me? > Best regards > ____ > From: Michael McCandless > Sent: Saturday, August

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless
the linked accounts coming! Mike On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah wrote: > Hi, > My mapping is: > JiraName,GitHubAccount,JiraDispName > shahrs87, shahrs87, Rushabh Shah > > Thank you Tomoko and Mike for all of your hard work. > > > > > On Sun, Jul 31, 2022 at 3

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless
d: wjp719 > > the jira issue I create before: > https://issues.apache.org/jira/browse/LUCENE-10425 > the github pr I submit before: https://github.com/apache/lucene/pull/780 > > > Best Regards, > jianping weng > > > > Michael McCandless 于2022年7月31日周日 1

[HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless
Hello Lucene users, contributors and developers, If you have used Lucene's Jira and you have a GitHub account as well, please check whether your user id mapping is in this file: https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified If

Re: Unclear on what position means

2022-07-22 Thread Michael McCandless
Hi Kendall, "Position" and "Offset" are often confused in Lucene ;) Lucene uses offset to track what you referred to ("(character, not byte) offset into a text file", or into an indexed string). Lucene uses position to track the Nth token: position 0 is first token, position 1 is the second

Re: Replicator PrimaryNode waits forever for remotes to close

2022-06-30 Thread Michael McCandless
+1 to provide a timeout, or, to simply fix close to aggressively close regardless of what the replicas are doing? It's not a great design for primary to be so dependent on the replicas (but vice/versa makes sense?). Maybe open a Jira issue or starting PR so we can discuss? Thanks for uncovering

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
Antony, do you maybe have Microsoft Defender turned on, which might quarantine files that it suspects are malicious? I'm not sure if it is on by default these days on modern Windows boxes ... Mike McCandless http://blog.mikemccandless.com On Thu, May 5, 2022 at 10:34 AM Michael McCandless

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
On Thu, May 5, 2022 at 10:30 AM Uwe Schindler wrote: To find all errors in an index, you should pass -ea to the java command > line to enable assertions. > +1 Tempting to make CheckIndex demand that :) Or at least, slow you down and make it clear why, if assertions are disabled. Mike

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless
y update? > > Regards, > Antony > > On Sun, 1 May 2022 at 19:35, Antony Joseph > wrote: > >> Hi Michael, >> >> Thank you for your reply. Please find responses to your questions below. >> >> Regards, >> Antony >> >> On Sat, 30 Apr 20

Re: Index corruption and repair

2022-04-30 Thread Michael McCandless
Hi Antony, Hmm it looks like the root cause is this: Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si Can you list all the files in the index directory at the time this exception happens, and reply here? We need to figure out whether the file is really missing or what.

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael McCandless
Hello Claude, Hmm, that is interesting that you see slop=2 matching query "quick fox" against document "the fox is quick". Edit distance (Levenshtein) is a bit tricky because it might include a transposition (just swapping the two words) as edit distance 1 OR 2. So maybe Lucene's PhraseQuery is

Re: Java 17 and Lucene

2021-10-20 Thread Michael McCandless
UsageEstimator class. > > > > We suppressed the warning for now (based on recommendations > > > > <http://mail-archives.apache.org/mod_mbox/db-derby- > > > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800 > > > > 5...@atlassian.jira%3E>

Re: Java 17 and Lucene

2021-10-18 Thread Michael McCandless
Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to new JDK releases and leave an annotation on the nightly charts: https://home.apache.org/~mikemccand/lucenebench/ I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a few hours it should show the new data

Re: Issue regarding build

2021-08-19 Thread Michael McCandless
Hello Udit, The screen shot did not come through for me -- it's a broken image. Maybe copy/paste the text of the error instead? Also, try running "./gradlew assemble" from the command-line (in a console shell, e.g. Terminal on OS X) instead? Mike McCandless http://blog.mikemccandless.com On

Re: Info about the Lucene 4.10.4 version.

2021-06-22 Thread Michael McCandless
Hi Arvind, I responded about this on the issue you also opened: https://issues.apache.org/jira/browse/LUCENE-10013 Mike McCandless http://blog.mikemccandless.com On Tue, Jun 22, 2021 at 10:04 AM Arvind Kumar Sahu wrote: > Hi Team, > > Currently we are using Lucene 4.10.4 version. We are

Re: Multiple merge-runs from same set of segments

2021-05-24 Thread Michael McCandless
Are you trying to rewrite your already created index into a different segment geometry? Maybe have a look at the new IndexRearranger tool ? It is already doing something like what you enumerated below, including mocking LiveDocs to get the right

Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Michael McCandless
> The update showed no issues (e.g. compiled without changes) but I noticed that our test-suites take a lot longer to finish. Hmm, that sounds bad. We need our tests to stay fast but also do a good job testing things ;) Does your production usage also slow down? Tests do other interesting

Re: Correct usage of synonyms with Japanese

2021-05-18 Thread Michael McCandless
Hi Geoffrey, [Disclaimer: Geoffrey and I both work at Amazon on customer-facing product search] We absolutely must get SynonymGraphFilter consuming input graphs! This is just a (serious) bug in it! But it's just software, let's fix it :) That is clearly the right fix, it is just rather fun

Re: CorruptIndexException after failed segment merge caused by No space left on device

2021-03-24 Thread Michael McCandless
+1, this sounds like a bad bug in Lucene! We try hard to test for and prevent such bugs! As long as you succeeded in at least one commit since creating the index before you hit the disk full, restarting Lucene on the index should have recovered from that last successful commit. How often do you

Re: [VOTE] Lucene logo contest, third time's a charm

2020-12-21 Thread Michael McCandless
Thank you Ryan for pushing forwards to our new logo. Now that this VOTE has passed, are there issues open to actually "deliver it" to the world? E.g. I see https://lucene.apache.org still shows our old logo. Branding is a lot of work! Mike McCandless http://blog.mikemccandless.com On Tue,

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Michael McCandless
Hello, Yes, that is exactly what MMapDirectory.setPreload is trying to do, but not promises (it is best effort). I think it asks the OS to touch all pages in the mapped region so they are cached in RAM, if you have enough RAM. Make your JVM heap as low as possible to let the OS have more RAM to

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
WED_EMPTY", BooleanClause.Occur.SHOULD); > > On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > > > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must > enable > > norms on your field to use that. >

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless
Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must enable norms on your field to use that. TermRangeQuery is indeed a horribly costly way to execute this, but if you cache the result on each refresh, perhaps it is OK? You could also index a dedicated doc values field indicating

Re: BooleanQuery normal form

2020-09-27 Thread Michael McCandless
Hi Patrick, I don't think Lucene supports CNF or DNF for BooleanQuery? BooleanQuery will try to do some rewriting simplifications for degenerate cases, e.g. a BooleanQuery with a single clause. Probably it could do more optimizing? It is quite complex already :) Mike On Tue, Sep 22, 2020 at

Re: Optimizing term-occurrence counting (code included)

2020-09-21 Thread Michael McCandless
I left a comment on the issue. Mike McCandless http://blog.mikemccandless.com On Sun, Sep 20, 2020 at 1:08 PM Alex K wrote: > Hi all, I'm still a bit stuck on this particular issue.I posted an issue on > the Elastiknn repo outlining some measurements and thoughts on potential > solutions:

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-02 Thread Michael McCandless
A2, A1, C5, D (binding) Thank you to everyone for working so hard to make such cool looking possible future Lucene logos! And to Ryan for the challenging job of calling this VOTE :) Mike McCandless http://blog.mikemccandless.com On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst wrote: > Dear

Re: Hierarchical facet select a subtree but one child

2020-08-17 Thread Michael McCandless
I think this is a missing API in DrillDownQuery? Nicola, could you open an issue? The filtering is as Mike Sokolov described, but I think we should add a sugar method, e.g. DrillDownQuery.remove or something, to add a negated query clause. And until this API is added and you can upgrade to it,

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-30 Thread Michael McCandless
truct the values for the field being modified, or > am I likely to just run into more issues by modifying a loaded Document? > > Regards, > Albert > > > From: "Michael McCandless" > > To: "java-user" , "albert macsweeny" > > >

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Michael McCandless
Hi Albert, Unfortunately, you have fallen into a common and sneaky Lucene trap. The problem happens because you loaded a Document from the index's stored fields (the one you previously indexed) and then tried to modify that one and re-index. Lucene does not guarantee that this will work,

Re: Sharing buffer between large number of IndexWriters?

2020-06-22 Thread Michael McCandless
Hello Marcin, Alas, Lucene does not have this capability out of the box. However, you are able to live-update the IndexWriterConfig.setRAMBufferSizeMB, and the change should take effect on the next document indexed in that IndexWriter instance. So you could build your own "proportional RAM" on

Re: [VOTE] Lucene logo contest

2020-06-17 Thread Michael McCandless
Change is good :) I vote Option A (binding PMC vote). Thank you to all the open-source artists who helped out here. Mike McCandless http://blog.mikemccandless.com On Mon, Jun 15, 2020 at 6:08 PM Ryan Ernst wrote: > Dear Lucene and Solr developers! > > In February a contest was started to

Re: CheckIndex complaining about -1 for norms value

2020-06-11 Thread Michael McCandless
Maybe we should fix CheckIndex to print norms as unsigned integers? Mike McCandless http://blog.mikemccandless.com On Thu, Jun 11, 2020 at 3:00 AM Adrien Grand wrote: > To my knowledge, -1 always represented the maximum supported length, both > before and after 7.0 (when we changed the norms

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless
You're welcome! Mike McCandless http://blog.mikemccandless.com On Mon, Jun 8, 2020 at 10:48 AM Adarsh Sunilkumar < adarshsunilkuma...@gmail.com> wrote: > Hi Michael, > > Thanks for your information. > > > Thanks, > Adarsh Sunilkumar > > On Mon, Jun 8, 2020,

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless
jira/browse/LUCENE-8134 > <https://issues.apache.org/jira/browse/LUCENE-8134> > > Thanks& Regards, > Adarsh Sunilkumar > > On Fri, Jun 5, 2020 at 7:28 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > >> This just means you previo

Re: Lucene Migration issue

2020-06-05 Thread Michael McCandless
This just means you previously indexed only docis (skipping term frequencies, positions) for at least one of the fields in at least one document in your existing index. But now you are trying to also index with term frequencies and positions, which Lucene cannot do. You either have to reindex

Re: Retrieving query-time join fromQuery hits

2020-06-03 Thread Michael McCandless
left side of the join must retain some state, to know which top hits corresponded to those join values, and then add an API to retrieve them? Mike McCandless http://blog.mikemccandless.com On Wed, May 20, 2020 at 6:31 PM Michael McCandless < luc...@mikemccandless.com> wrote: >

Re: Retrieving query-time join fromQuery hits

2020-05-20 Thread Michael McCandless
I am trying first to understand the proposed solution from the previous thread. You run query #1, it returns top N hits. From those hits you ask JoinUtil to create the "joined" query #2. You run the query #2 to get the top final (joined) hits. Then, to reconstruct which docids from query #1

Re: Resizable LRUQueryCache

2020-03-10 Thread Michael McCandless
Maybe start with your own cache implementation that implements a resize method? The cache is pluggable through IndexSearcher. Fully discarding the cache and swapping in a newly sized (empty) one could also be jarring, so I can see the motivation for this method. Especially for usages that are

Re: Lucene 7.7.2 Indexwriter.numDocs() replacement in Lucene 8.4.1

2020-02-26 Thread Michael McCandless
Yes. Mike McCandless http://blog.mikemccandless.com On Mon, Feb 24, 2020 at 5:55 PM wrote: > A typo corrected below. > > Best regards > > > On 2/24/20 5:54 PM, baris.ka...@oracle.com wrote: > > Hi,- > > > > I hope everyone is doing great. > > > > > > I think the Lucene 7.7.2

Re: Searching number of tokens in text field

2020-01-02 Thread Michael McCandless
Norms encode the number of tokens in the field, but in a lossy manner (1 byte by default), so you could probably create a custom query that filtered based on that, if you could tolerate the loss in precision? Or maybe change your norms storage to more precision? You could use

Re: Lucene Index Cloud Replication

2019-07-09 Thread Michael McCandless
+1 to share code for doing 1) and 3) both of which are tricky! Safely moving / copying bytes around is a notoriously difficult problem ... but Lucene's "end to end checksums" and per-segment-file-GUID make this safer. I think Lucene's replicator module is a good place for this? Mike McCandless

Re: find documents with big stored fields

2019-07-01 Thread Michael McCandless
Hi Rob, The codec records per docid how many bytes each document consumes -- maybe instrument the codec's sources locally, then open your index and have it visit stored fields for every doc in the index and gather stats? Or, to avoid touching Lucene level code, you could make a small tool that

Re: ArrayIndexOutOfBoundsException during System.arraycopy in BKDWriter

2019-05-03 Thread Michael McCandless
Note that the -Xint flag will make your code run tremendously more slowly! Likely to the point of not really being usable. But it'd be interesting to see if that side-steps the bug. Is it possible to test with OpenJDK as well? The BKDWriter code is quite complex, so it is also possible there

Re: Ask about Lucene/Core/Index DocumentsWriter

2019-03-19 Thread Michael McCandless
Can you try increasing your IndexWriter.setRAMBufferSizeMB? That flush control logic will block incoming threads if the number of bytes trying to flush to disk is too large relative to your RAM buffer. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 18, 2019 at 2:30 PM yuncheng lu

Re: FlattenGraphFilter assertion error

2019-03-12 Thread Michael McCandless
.java:195) > at > org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258) > at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32) > > It's the interaction between WordDelimiterGraphFilter and stop word > removal, it seems, that t

Re: IndexWriter concurrent flushing

2019-02-17 Thread Michael McCandless
+1 to make it simple to let multiple threads help with commit/refresh operations. IW.yield is a simple way to achieve it, matching (roughly) how IW's commit/refresh work today, hijacking incoming indexing threads to gain concurrency. I think this would be a small change? Adding an

Re: prorated early termination

2019-02-03 Thread Michael McCandless
On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov wrote: > > In single-threaded mode we can check against minCompetitiveScore and > terminate collection for each segment appropriately, > > > Does Lucene do this today by default? That should be a nice > optimization, > and it'd be safe/correct. >

Re: prorated early termination

2019-02-03 Thread Michael McCandless
I think this is because our per-hit cost is sometimes very high -- we have "post filters" that are sometimes very restrictive. We are working to get those post-filters out into an inverted index to make them more efficient, but net/net reducing how many hits we must collect for each segment can

Re: prorated early termination

2019-02-03 Thread Michael McCandless
One question about this: > In single-threaded mode we can check against minCompetitiveScore and terminate collection for each segment appropriately, Does Lucene do this today by default? That should be a nice optimization, and it'd be safe/correct. Mike McCandless

Re: RamUsageCrawler

2018-12-06 Thread Michael McCandless
I think you mean RamUsageEstimator (in Lucene's test-framework)? It's entirely possible it fails to dig into Maps correctly with newer Java releases; maybe Dawid or Uwe would know? Mike McCandless http://blog.mikemccandless.com On Tue, Dec 4, 2018 at 12:18 PM Michael Sokolov wrote: > Hi,

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-05 Thread Michael McCandless
in the > documentation does it say that these two calls should be synchronized... > at least that must be fixed. :) > > On 12/1/18 6:25 PM, Michael McCandless wrote: > > I think if you call commit and close concurrently the results are > undefined > > and so this is acceptable.

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-01 Thread Michael McCandless
I think if you call commit and close concurrently the results are undefined and so this is acceptable. Mike On Thu, Nov 29, 2018 at 5:53 AM Boris Petrov wrote: > Hi all, > > We're getting the following exception: > > java.lang.IllegalStateException: cannot close: prepareCommit was already >

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-12 Thread Michael McCandless
Thanks for bringing closure, Boris. Mike McCandless http://blog.mikemccandless.com On Mon, Nov 12, 2018 at 7:13 AM Boris Petrov wrote: > Hello, > > OK, so actually this appears to be a bug in our code - Lucene is searching > correctly, we were doing something wrong with the result after

Re: MultiPhraseQuery or PhraseQuery to take the synonyms into account?

2018-09-22 Thread Michael McCandless
PhraseQuery can indeed be used to represent a multi-token synonym. In fact, I mis-spoke before: MultiPhraseQuery can also represent a multi-token synonym when the multiple tokens are all the same except in one spot. Mike McCandless http://blog.mikemccandless.com On Thu, Sep 20, 2018 at 2:32

Re: Question About FST, multiple-column index

2018-09-22 Thread Michael McCandless
You might want to index the name field normally (as StringField, for example), then index the age as a NumericDocValuesField, and then make a BooleanQuery with two required clauses, one clause TermQuery on the name, the other a NumericDocValuesField.newSlowExactQuery. Even though its name is

Re: MultiPhraseQuery

2018-09-18 Thread Michael McCandless
Yes, +1 for a patch to improve the docs! MultiPhraseQuery only works for single term synonyms, and is usually produced by query parsers when the incoming query text had single term synonyms matching, I think? The query parser will use other (span?) queries for multi token synonyms. I think the

Re: SynonymGraphFilter

2018-09-11 Thread Michael McCandless
Try reading the blog post I wrote about token stream graphs? http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html Mike McCandless http://blog.mikemccandless.com On Tue, Sep 11, 2018 at 1:35 PM, wrote: > Any comments please? > > Thanks > > > On 9/10/18 5:07 PM,

Re: SynonymMap.Builder.add method

2018-09-11 Thread Michael McCandless
That's correct. When the input sequence is seen during tokenization, the synonym (graph) filter will also insert the output tokens into the TokenStream, as if they "naturally" occurred. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 11, 2018 at 1:35 PM, wrote: > Any comments

Re: SynonymMap

2018-09-10 Thread Michael McCandless
The SynonymMap.Builder constructor takes a dedup parameter to tell it what to do in that case (when input and output are identical across added rules). Mike McCandless http://blog.mikemccandless.com On Thu, Sep 6, 2018 at 2:06 PM, Baris Kazar wrote: > Hi,- > how does SynonymMap deal with

Re: offsets

2018-07-29 Thread Michael McCandless
How would a fixup API work? We would try to provide correctOffset throughout the full analysis chain? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run into some difficulties with offsets in some TokenFilters I've been >

Re: Deleted documents and NRT Readers

2018-07-20 Thread Michael McCandless
one > document and then update it, that the load is so small that it for sure > would not have applied the delete. > > Why am I wrong in thinking this? > > > On Thu, Jul 19, 2018, 5:50 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Passing app

Re: Deleted documents and NRT Readers

2018-07-19 Thread Michael McCandless
Passing applyDeletes=false means Lucene does not have to apply all of its buffered deletes. But, it still may have already applied some deletes, so there's no guarantee that it won't have applied deletes. Mike McCandless http://blog.mikemccandless.com On Thu, Jul 19, 2018 at 3:23 PM, Stuart

Re: Lucene Speed

2018-07-18 Thread Michael McCandless
Hi Ehson, Have you looked at the luceneutil source code that runs the benchmarks? https://github.com/mikemccand/luceneutil The sources are not super clean, but that's what's running the nightly benchmarks, starting from src/main/perf/Indexer.java. Mike McCandless http://blog.mikemccandless.com

Re: Recreating index lucene without stopping client applications

2018-07-18 Thread Michael McCandless
If you use IndexWriter.deleteAll, and not any of the other delete by Query, Term methods, it should be quite efficient to delete, as IndexWriter just drops all segments. That API is also transactional, so you could call IW.deleteAll, proceed to reindex all your documents, and if somehow that

Re: UTF8TaxonomyWriterCache inconsistency

2018-07-02 Thread Michael McCandless
Yes please create a Jira issue! Mike On Mon, Jul 2, 2018, 12:31 AM Руслан Торобаев wrote: > Hi! > > I’m facing a problem with taxonomy writer cache inconsistency. At some > point in time UTF8TaxonomyWriterCache starts to return wrong ord for some > facet labels. As result wrong ord are written

Re: Help! - Max Segment name reached

2018-04-21 Thread Michael McCandless
Well I think as time goes on we'll see more and more people running into it ;) But you really need to commit at a surprisingly high rate, and have a surprisingly long lived index, to overflow the int that holds the segment number. E.g. if you commit once per second, it should take ~68 years to

Re: WordDelimiterGraphFilter does not respect KeywordAttribute

2018-04-21 Thread Michael McCandless
+1 Mike On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov wrote: > I have a use case that generates some tokens containing punctuation > (fractions and other numerical constructs), but I am handling most > punctuation with WordDelimiterGraphFilter, which then decomposes those >

Re: IndexWriter updateDocument is removing doc from index

2018-03-16 Thread Michael McCandless
Yes you can add documents by calling updateDocument -- if no prior documents matched the deletion Term you provide, nothing is deleted and your new doc is added. Hmm are you sure your 2nd update really updated and then added 12 new docs? Dropping segment 1 makes sense because you deleted the

Re: any api to get segment number of index

2018-01-14 Thread Michael McCandless
How about IndexSearcher.getIndexReader().leaves().size()? Mike McCandless http://blog.mikemccandless.com On Wed, Jan 10, 2018 at 5:19 AM, Yonghui Zhao wrote: > Hi, > > Is there any public API that I can get segment number of current version > index? > > I didn't find in

Re: typed IntPoint.RangeQuery & LongPoint.rangeQuery

2018-01-09 Thread Michael McCandless
Lucene doesn't (shouldn't?) let you add 'a' at first as an IntPoint and then later as a LongPoint -- they must always be consistent. So however you indexed it, you must use the corresponding class to construct the query. String 'hi' can only be found if you had indexed a token 'hi' in that field

Re: index sorting merge

2017-12-28 Thread Michael McCandless
You should upgrade to newer versions of Lucene, where all segments are sorted, not just merged segments. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 28, 2017 at 11:13 AM, Yonghui Zhao wrote: > Hi, > > I specified a SortingMergePolicy in my case. I find

Re: may be lucene bug

2017-12-28 Thread Michael McCandless
I think there's a bug in your code: this line: doc.doc <= leaf.docBase + leaf.reader().maxDoc()) should be < not <=. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 28, 2017 at 6:15 AM, 291699763 <291699...@qq.com> wrote: > Lucene version:6.6.0 > > when Index >

Re: CompiledAutomaton performance issue

2017-12-17 Thread Michael McCandless
This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization? Can you open a Jira issue so we can discuss options? Thanks, Mike McCandless http://blog.mikemccandless.com On Fri, Dec 15, 2017

Re: Optimize FTS memory footprint

2017-12-12 Thread Michael McCandless
Try upgrading Elasticsearch -- it's up to 6.0 release just a few week ago now -- its (and Lucene's) memory usage has decreased over time. The _uid field in particular will always be costly, unfortunately. Since it's a primary key, every term will be unique, and the term index has to work hard to

Re: Optimize FTS memory footprint

2017-12-12 Thread Michael McCandless
Comments below: On Tue, Nov 28, 2017 at 4:47 PM, elirev wrote: > Thanks Mike . > I did not find any clear way to know it its FST or Norm , or something > else ( unless i miss something ) the fact the FST is an in memory prefix > index lead me to think it using most

Re: Optimize FTS memory footprint

2017-11-20 Thread Michael McCandless
Are you sure its FSTs using your heap? Do you have many index fields that have high cardinality? Or many suggesters? Mike McCandless http://blog.mikemccandless.com On Thu, Nov 16, 2017 at 5:03 PM, Eli Revach wrote: > Hi > I am using Elasticserach 1.7.5 , our segment

  1   2   3   4   5   6   7   8   9   10   >