RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906
Thanks Robert, Another idea apart from your solution would be to add a tailoring for tibetan that sets some special attribute indicating 'word-final syllable'. Then this information is not 'lost' and downstream can do the right thing. ...So essentially before doing anything like that, it would be best to know 'the rules of the game' before thinking about any design. So the ICUTokenizer would have to add that word-final syllable attribute based on some rules and then a downstream filter could use the attributes to constuct bigrams without creating stupid bigrams. If we end up doing the project, we will be working with people who have expertise in Tibetan and hopefully would be able to tell us the rules of the game Tom ___ Another idea apart from your solution would be to add a tailoring for tibetan that sets some special attribute indicating 'word-final syllable'. Then this information is not 'lost' and downstream can do the right thing. Its not a difficult thing to do for the tokenizer, but we would need more details: a quick glance at some stuff on tibetan punctuation indicates its not 'this simple': for some syllables sometimes the punctuation is omitted. Honestly i don't know why this is, maybe it means there are some syllables that only appear in word-final position? If so, such important clues should also trigger this attribute. So essentially before doing anything like that, it would be best to know 'the rules of the game' before thinking about any design. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906
The ICUTokenizer now adds a script attribute for tokens (as do Standard Tokenizer and a couple of others (LUCENE-2911) For example Tibetan or Han. If the Shingle filter had some provision to only make token n-grams when the script attribute matched some specified script, it would solve both the need to produce character bigrams for CJK ( Han) and syllable bigrams for Tibetan. We already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) . Would it make sense to open an issue for modifying the Shingle filter to have configurable script-specific behavior, or is this just another use case for LUCENE 2906? If it is another use case for LUCENE 2906, then perhaps we need to change the summary of the issue to generalize it beyond CJK. Any suggestions ? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search
RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906
Hi Robert, Thanks for the quick and thoughtful response. I didn't realize these complexities and thought maybe there was an easy solution :) We may be involved in a project that involves Tibetan text and given our current resources and priorities, we would stick it in the same field as the other 400+ languages. I was hoping that with the script attribute output by the ICUTokenizer, we could figure out something to do script/language specific processing for Tibetan without adversely affecting anything else. . I suppose to inhibit stupid bigrams you would *not*shingle across shad as well Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan phrase separators but downstream filters won't know that, so we couldn't have a downstream filter that avoided bigramming across a phrase separator. On the other hand it might be that stupid overlapping bigrams don't hurt retrieval compared to treating syllables as if they were words i.e. syllable unigrams. ( I've not been able to find much published research in English on the issue, and many of the references are to articles in Chinese language publications. I'm pretty much relying on the article by Hackett and Oard) Tom Hackett, P. G., Oard, D. W. (2000). Comparison of word-based and syllable-based retrieval for Tibetan (poster session). In Proceedings of the fifth international workshop on on Information retrieval with Asian languages - IRAL '00 (pp. 197-198). Presented at the the fifth international workshop on, Hong Kong, China. doi:10.1145/355214.355242 http://dl.acm.org/citation.cfm?doid=355214.355242 -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, December 16, 2011 6:45 PM To: dev@lucene.apache.org Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906 On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom tburt...@umich.edu wrote: The ICUTokenizer now adds a script attribute for tokens (as do Standard Tokenizer and a couple of others (LUCENE-2911) For example “Tibetan” or “Han”. If the Shingle filter had some provision to only make token n-grams when the script attribute matched some specified script, it would solve both the need to produce character bigrams for CJK ( Han) and syllable bigrams for Tibetan. We already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) . Not sure it totally would because there are key important differences, and a few complications: 1. CJKTokenizer today creates bigrams in runs cjk text where this is something like: [IHK]+ (run of ideographic, hiragana, katakana). There are different variations on this available too, like only bigram I+ and do something else with the katakana (like keep as word). Seems like the verdict from previous studies is that there are options there and they tend to both work well. But one thing is still for sure, I think it would bad here to form bigrams across what was not contiguous text (e.g. across sentence boundaries). Finally, some CJK normalization (such as halfwidth/fullwidth conversion) is not 1:1 replacement and so really the process here should at least be aware of this and consider some sequences of half-width-kana as a single 'character'. 2. Unlike the CJK case, where you bigram a run, Tibetan separates syllables with special punctuation (tsheg among other things). The reason you have syllables as output from these tokenizers is because of this reason. So this is already a fundamentally different bigram algorithm, because its not longer contiguous runs, instead syllables often had something in between, and depending upon what that something is tells you if its e.g. a syllable separator or something more like a phrase separator. I suppose to inhibit stupid bigrams you would *not* shingle across shad as well.. how to generalize that? The verdict for this language definitely isn't out here, I've only see some very initial rough work on this language and we aren't totally sure this works well on average. 3. Other complex languages besides these are also emitting syllables at best, too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too? Except, one implementation (ICUTokenizer) is emitting syllables here (what type of syllable depends upon the current implementation, too!), and the other (StandardTokenizer) is emitting whole phrases as words. Would be great to bigram the former (we think!), but even more horrible to do it to the latter. I put we think here because there has really been no work done here, so its just intuition/guessing. And to make matters worse, we have a filter in contrib (ThaiWordFilter) that relies upon the specifics of how StandardTokenizer screws up Thai tokenization so it can 'retokenize'. Would it make sense to open an issue for modifying the Shingle filter to have configurable script-specific behavior, or is this just another use case for LUCENE 2906? If it is another use case for LUCENE 2906, then perhaps we
re: LUCENE-167 and Solr default handling of Boolean operators is broken
The default query parser in Solr does not handle precedence of Boolean operators in the way most people expect. A AND B OR C gets interpreted as A AND (B OR C) . There are numerous other examples in the JIRA ticket for Lucene 167, this article on the wiki http://wiki.apache.org/lucene-java/BooleanQuerySyntax and in this blog post: http://robotlibrarian.billdueber.com/solr-and-boolean-operators/ This issue was reported in 2003 but the fix does not seem to have made it into the default query parser for either Lucene or Solr It appears that Lucene 167 was closed in 2009 based on the assumption that the query parser in Lucene 1823 would become the default Lucene query parser. However 1823 seems to have gotten bogged down and is not yet resolved. I do see that there is a precedence query parser in LUCENE-1937 which was committed to contrib. in the 3x branch:(http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/precedence/package.html?view=co) Would it be possible to use the contrib 3x precedence query parser in Solr? Would this require modifying the LuceneQParserPlugin and if so would it make sense to open a JIRA issue? Are there any plans to make the precedence query parser the default for either Lucene or Solr? If not, are there any plans to make it more prominent in the documentation that the default Lucene query parser has issues with precedence? A bit more background below Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search More Background There were some concerns about breaking backward compatibility but in a mailing list post in 2005 Yonik Sealy said: The current behavior is so surprising that I doubt that no one is relying on it. (http://www.mail-archive.com/java-user@lucene.apache.org/msg00018.html) and Doug Cutting said +1. Fixing operator precedence seems to me like an acceptable incompatibility. The change needs to be well documented in release notes, and the old QueryParser should be available, deprecated, for a time for back-compatibility. (http://www.mail-archive.com/java-user@lucene.apache.org/msg00037.html)
RE: LUCENE-167 and Solr default handling of Boolean operators is broken
Thanks Yonik, Should I open a Solr JIRA issue? Tom -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, December 01, 2011 1:16 PM To: dev@lucene.apache.org Subject: Re: LUCENE-167 and Solr default handling of Boolean operators is broken Whew, that was a while ago - didn't remember even commenting on the issue, but it still makes sense (double-negative aside... boy I hate re-reading things I wrote to quickly ;-) The old precedence query parser had issues IIRC. The precedence query parser based on the flexible queryparser framework in contrib isn't that Solr friendly (i.e. Solr has a lot of hooks into the current standard query parser and moving would probably be both error prone and difficult). SolrCloud is consuming my time right now, but I might be able to take look to see if this is easy to fix in another month or so (if no one beats me to it). Since it's a major release, we may be able to just fix it in trunk w/o having to keep the old behavior. -Yonik http://www.lucidimagination.com On Thu, Dec 1, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote: The default query parser in Solr does not handle precedence of Boolean operators in the way most people expect. A AND B OR C gets interpreted as A AND (B OR C) . There are numerous other examples in the JIRA ticket for Lucene 167, this article on the wiki http://wiki.apache.org/lucene-java/BooleanQuerySyntax and in this blog post: http://robotlibrarian.billdueber.com/solr-and-boolean-operators/ This issue was reported in 2003 but the fix does not seem to have made it into the default query parser for either Lucene or Solr It appears that Lucene 167 was closed in 2009 based on the assumption that the query parser in Lucene 1823 would become the default Lucene query parser. However 1823 seems to have gotten bogged down and is not yet resolved. I do see that there is a precedence query parser in LUCENE-1937 which was committed to contrib. in the 3x branch:(http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/precedence/package.html?view=co) Would it be possible to use the contrib 3x precedence query parser in Solr? Would this require modifying the LuceneQParserPlugin and if so would it make sense to open a JIRA issue? Are there any plans to make the precedence query parser the default for either Lucene or Solr? If not, are there any plans to make it more prominent in the documentation that the default Lucene query parser has issues with precedence? A bit more background below Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search More Background There were some concerns about breaking backward compatibility but in a mailing list post in 2005 Yonik Sealy said: The current behavior is so surprising that I doubt that no one is relying on it. (http://www.mail-archive.com/java-user@lucene.apache.org/msg00018.html) and Doug Cutting said +1. Fixing operator precedence seems to me like an acceptable incompatibility. The change needs to be well documented in release notes, and the old QueryParser should be available, deprecated, for a time for back-compatibility. (http://www.mail-archive.com/java-user@lucene.apache.org/msg00037.html) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Solr should provide an option to show only most relevant facet values
Hello all, This post is getting no replies after several days on the Solr user list, so I thought I would rewrite it as a question about a possible feature for Solr. In our use case we have a large number of documents and several facets such as Author and Subject, that have a very large number of values. Since we index the full text of nearly 10 million books, it is easy for a query to return a very large number of hits. Here is the problem: If relevance ranking is working well, in theory it doesn't matter how many hits the user gets as long as the best results show up in the first page of results. When showing facet values, if the values for a particular facet have a large number of values, the general practice is to show a relatively small number of facet values, selected as those values with the highest counts in the entire result set. However, assuming a very large result set, these facet counts will be affected by the large number of results that are not relevant to the query As an example, if you search in our full-text collection for jaguar you get 170,000 hits. If I am looking for the car rather than the OS or the animal, I might expect to be able to click on a facet and limit my results to the car. However, facets containing the word car or automobile are not in the top 5 facets that we show. If you click on more you will see automobile periodicals but not the rest of the facets containing the word automobile . This occurs because the facet counts are for all 170,000 hits. The facet counts for at least 160,000 irrelevant hits are included (assuming only the top 10,000 hits are relevant) . What we would like to do is *select* which facet values to show, based on their counts in the *most relevant subset* of documents, but display the actual counts for the full set: 1) get the facet counts for the N most relevant documents (N = 10,000 for example) 2) select the 5 or 30 facet values with the highest counts for those relevant documents. 3) display only the facet values for those 5 or 30 values, but display the counts for those values against the entire result set. This is possible to kludge up (subject to some scaling considerations) in the following way: 1) Consider only the 1000 most relevant documents for doing the calculation so N = 1,000 2) do your query and get the unique document ids for the N most relevant documents. (i.e. set rows=N) also get the facet values and counts for the top M facets, where M is some very large number and store the facet values and counts in some data structure. 3) run a second query which is the same as the first, but add a filter query for those 1000 unique ids, set rows =1 but get facet counts for the top 30 facet values 4) Grab the top 5 or 30 facet values from this second query. These are your most relevant facet values 5) Use the list of values from the previous step to retrieve the appropriate counts for the whole result set from the earlier stage where you stored the facet counts for the whole result set It would seem that this could be done much more efficiently inside of Solr/Lucene, since instead of getting the unique ids for the N most relevant documents, and sending those back to Solr, the code actually has access to bitsets containing the internal Lucene index ids which get used in the filter queries. Other steps in the process could probably be streamlined as well. Is there already some faceting code work being done along this line? Would it make sense to open a JIRA issue for this? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search
RE: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format
Hi David, Just curious about your use of the HathiTrust list. I usually explain to people that it's customized to our index and they are probably better off making their own list based on the lists of stop words appropriate for the languages in their index (sources listed in the blog post http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance) If you already have an index built and are re-indexing with CommonGrams , you can also use the -t flag with HighFreqTerms.java in lucene contrib to determine the words that have the largest position lists and are therefore candidates to be added to your CommonGrams word list. We recently ran HighFreqTerms.java against our indexes and discovered that it would be better to remove some of the less frequent foreign language stopwords and instead use some very frequent words from the index. Tom Burton-West www.hathitrust.org/blogs From: Steven Rowe (JIRA) [j...@apache.org] Sent: Monday, June 06, 2011 2:08 PM To: dev@lucene.apache.org Subject: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format [ https://issues.apache.org/jira/browse/SOLR-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved SOLR-1844. --- Resolution: Won't Fix Assignee: Steven Rowe Thanks David. CommonGramsQueryFilterFactory should read words in a comma-delimited format --- Key: SOLR-1844 URL: https://issues.apache.org/jira/browse/SOLR-1844 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 1.4 Reporter: David Smiley Assignee: Steven Rowe Priority: Minor CommonGramsQueryFilterFactory expects that the file(s) given to the words argument is a carriage-return delimited list of words. It doesn't support comments either. This file format should be more flexible to support comma delimited values. I came across this because I was trying to use the sample file provided by HathiTrust: http://www.hathitrust.org/node/180(named in a file new400common.txt) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Hi Mike and Shai, I was able to index a few documents with the tieredMergePolicy but I was hoping to build a large test index of about 700,000 documents to compare the performance against our previous runs. I was hoping I would be able to report on my results in time for the Lucene Revolution conference. Unfortunately there was a power outage at our data center last week which resulted in a node failure in one of our storage nodes and node rebalancing for a cluster of 500 terabytes takes quite a while and totally messes up performance measurements. (Our 6-8 terabytes of large scale search indexes shares storage with the repository that holds the 480+ terabytes of page images and metadata for the 8 million+ books). Hopefully I will be able to run the tests when I get back. Tom From: Burton-West, Tom [mailto:tburt...@umich.edu] Sent: Monday, May 09, 2011 4:10 PM To: dev@lucene.apache.org Subject: RE: MergePolicy Thresholds Thanks again Shai and Mike. Am in the process of downloading and building r108. Should be able to build a test index sometime this week. I'll make some guesses on what parameters to use based on our previous tests. Tom From: Shai Erera [mailto:ser...@gmail.com] Sent: Saturday, May 07, 2011 11:33 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Hey Tom, Mike back-ported the changes to 3x, so you can try it out. FYI, Shai On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edumailto:tburt...@umich.edu wrote: Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.commailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.orgmailto:dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr
RE: MergePolicy Thresholds
Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Hi Shai and Mike, Testing the TieredMP on our large indexes has been on my todo list since I read Mikes blog post http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. If you port it to the 3.x branch Shai, I'll be more than happy to test it with our very large (300GB+) indexes. Besides being able to set the max merged segment size, I'm especially interested in using the maxSegmentsPerTier parameter. From Mike's blog post: ...maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, May 02, 2011 2:19 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Link to nightly build test reports on main Lucene site needs updating
Thanks for fixing++ Tom -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, May 01, 2011 6:05 AM To: dev@lucene.apache.org; simon.willna...@gmail.com; java-u...@lucene.apache.org Subject: RE: Link to nightly build test reports on main Lucene site needs updating I fixed the nightly docs, once the webserver mirrors them from SVN they should appear. The developer-resources page was completely broken. It now also contains references to the stable 3.x branch as most users would prefer that one to fix latest bugs but don’t want to have a backwards-incompatible version. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
RE: Using contrib Lucene Benchmark with Solr
Thanks Robert and Grant, Does this need a separate JIRA issue dealing specifically with the ability of benchmark to read Solr config settings, or is it subsumed in LUCENE-2845? or should I just add a comment to LUCENE-2845? Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, March 30, 2011 7:56 PM To: dev@lucene.apache.org Subject: Re: Using contrib Lucene Benchmark with Solr On Wed, Mar 30, 2011 at 4:49 PM, Burton-West, Tom tburt...@umich.edu wrote: I would like to be able to use the Lucene Benchmark code with Solr to run some indexing tests. It would be nice if Lucene Benchmark to could read Solr configuration rather than having to translate my filter chain and other parameters into Lucene. Would it be appropriate to open a JIRA issue for this or is this something that doesn’t really make any sense? I think it makes great sense, we moved the benchmarking facility to a toplevel module so we can do this: https://issues.apache.org/jira/browse/LUCENE-2845, but we didn't actually add any integration yet. I've been in this exact same situation too when trying to use the benchmark package, and I'd sure like to see better solr integration with the benchmarking package myself. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Using contrib Lucene Benchmark with Solr
I would like to be able to use the Lucene Benchmark code with Solr to run some indexing tests. It would be nice if Lucene Benchmark to could read Solr configuration rather than having to translate my filter chain and other parameters into Lucene. Would it be appropriate to open a JIRA issue for this or is this something that doesn't really make any sense? Tom
RE: Is it possible to set the merge policy setMaxMergeMB from Solr
I'm a bit confused. There are some examples in the JIRA issue for Solr 1447, but I can't tell from reading it what the final allowed syntax is. I see !--mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy-- !--double name=maxMergeMB64.0/double-- !--/mergePolicy-- in the JIRA issue and in what I think is the test case config file: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/test/test-files/solr/conf/solrconfig-propinject.xml?view=log Lance's example is mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy maxMergeMB1024/maxMergeMB /mergePolicy Which one is correct? Tom -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Tuesday, December 07, 2010 10:48 AM To: dev@lucene.apache.org Subject: Re: Is it possible to set the merge policy setMaxMergeMB from Solr SOLR-1447 added this functionality. On Mon, Dec 6, 2010 at 2:34 PM, Burton-West, Tom tburt...@umich.edu wrote: Lucene has this method to set the maximum size of a segment when merging: LogByteSizeMergePolicy.setMaxMergeMB (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29 ) I would like to be able to set this in my solrconfig.xml. Is this possible? If not should I open a JIRA issue or is there some gotcha I am unaware of? Tom Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Is it possible to set the merge policy setMaxMergeMB from Solr
Lucene has this method to set the maximum size of a segment when merging: LogByteSizeMergePolicy.setMaxMergeMB (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29 ) I would like to be able to set this in my solrconfig.xml. Is this possible? If not should I open a JIRA issue or is there some gotcha I am unaware of? Tom Tom Burton-West
Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Hello all, I am using Solr 1.4.1 and a custom filter that worked with a previous version of Solr that used Lucene 2.9. When I try to use the analysis console I get this error message: java.lang.IllegalArgumentException: This AttributeSource contains AttributeImpl of type org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in the target (See below for stack trace that shows this is an interaction of the custom punctuation filter and the Analysis jsp) I believe this has to do with this JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2302 I looked at the most recent org.apache.lucene.analysis package document http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/package.html?view=co but didn't see a mention of CharTermAttributeImpl Can someone point me to the documentation or example code that might explain the issue? Tom Burton-West Stack trace excerpt: Caused by: java.lang.IllegalArgumentException: This AttributeSource contains AttributeImpl of type org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in the target at org.apache.lucene.util.AttributeSource.copyTo(AttributeSource.java:493) at org.apache.jsp.admin.analysis_jsp$1.incrementToken(org.apache.jsp.admin.analysis_jsp:102) at org.apache.solr.analysis.PunctuationFilter.incrementToken(PunctuationFilter.java:40) at org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:131) at org.apache.jsp.admin.analysis_jsp.doAnalyzer(org.apache.jsp.admin.analysis_jsp:110) at org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:718)
RE: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Something here is using lucene 3.x or trunk code, since CharTermAttribute[Impl] only exists in unreleased versions! Doh! I forgot to switch my binaries back to Solr 1.4.1 from 3.x. Thanks for the catch Robert. The subject line should read: Solr/Lucene 3.x Analysis console gives error regarding CharTermAttributeImpl that is not in the target I do need to port my filter to lucene 3.x, so is there 3.x documentation about use of CharTermAttributeImpl? Is this something that needs to be in the TokenStream examples in the 3.0.2 org.apache.lucene.analysis package.html? Tom
RE: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Ok, I was using a recent unreleased version of Solr/Lucene but looking at the Lucene 3.0.2 docs instead of the nightly build docs. Found the answer I needed in the nightly build docs. https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//core/org/apache/lucene/analysis/package-summary.html Tom -Original Message- From: Burton-West, Tom [mailto:tburt...@umich.edu] Sent: Thursday, November 11, 2010 1:26 PM To: dev@lucene.apache.org Subject: RE: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target Something here is using lucene 3.x or trunk code, since CharTermAttribute[Impl] only exists in unreleased versions! Doh! I forgot to switch my binaries back to Solr 1.4.1 from 3.x. Thanks for the catch Robert. The subject line should read: Solr/Lucene 3.x Analysis console gives error regarding CharTermAttributeImpl that is not in the target I do need to port my filter to lucene 3.x, so is there 3.x documentation about use of CharTermAttributeImpl? Is this something that needs to be in the TokenStream examples in the 3.0.2 org.apache.lucene.analysis package.html? Tom
RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Thanks Uwe, A bug in analysis.jsp is consistent with what I am seeing. I can run explain/debug queries using my filter in the Solr/Lucene 3.x version and it’s clearly working. However I get the error when I try the analysis console. Is this the same issue as SOLR-2051? Tom From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Thursday, November 11, 2010 1:49 PM To: Burton-West, Tom; dev@lucene.apache.org Subject: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target I still think this is a bug in analysis.jsp. Copyto does not work here correctly because it tries to copy a ta to cta.seems that analysis.hap does generate the Target attributesource incorrect. I will look into this. --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Burton-West, Tom tburt...@umich.edu Datum: Do., Nov. 11, 2010 19:03 Betreff: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target An: dev@lucene.apache.org dev@lucene.apache.org Hello all, I am using Solr 1.4.1 and a custom filter that worked with a previous version of Solr that used Lucene 2.9. When I try to use the analysis console I get this error message: java.lang.IllegalArgumentException: This AttributeSource contains AttributeImpl of type org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in the target (See below for stack trace that shows this is an interaction of the custom punctuation filter and the Analysis jsp) I believe this has to do with this JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2302 I looked at the most recent org.apache.lucene.analysis package document http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/package.html?view=co but didn't see a mention of CharTermAttributeImpl Can someone point me to the documentation or example code that might explain the issue? Tom Burton-West Stack trace excerpt: Caused by: java.lang.IllegalArgumentException: This AttributeSource contains AttributeImpl of type org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in the target at org.apache.lucene.util.AttributeSource.copyTo(AttributeSource.java:493) at org.apache.jsp.admin.analysis_jsp$1.incrementToken(org.apache.jsp.admin.analysis_jsp:102) at org.apache.solr.analysis.PunctuationFilter.incrementToken(PunctuationFilter.java:40) at org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:131) at org.apache.jsp.admin.analysis_jsp.doAnalyzer(org.apache.jsp.admin.analysis_jsp:110) at org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:718)
RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Sorry about the confusion (my confusion mostly:). I was actually using revision 1030032 of Lucene/Solr (see below) with a custom token filter that does not use CharTermAttribute. I'll recompile the custom filter against this revision and verify that the analysis.jsp produces the same results in a few minutes. Solr Specification Version: 3.0.0.2010.11.03.16.59.02 Solr Implementation Version: 3.1-SNAPSHOT 1030032 - tburtonw - 2010-11-03 16:59:02 Lucene Specification Version: 3.1-SNAPSHOT Lucene Implementation Version: 3.1-SNAPSHOT 1030032 - 2010-11-03 17:00:44 Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, November 11, 2010 1:54 PM To: dev@lucene.apache.org Subject: Re: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target On Thu, Nov 11, 2010 at 1:49 PM, Uwe Schindler u...@thetaphi.de wrote: I still think this is a bug in analysis.jsp. Copyto does not work here correctly because it tries to copy a ta to cta.seems that analysis.hap does generate the Target attributesource incorrect. I will look into this. I think (perhaps I am mistaken), that Tom somehow mixed up some newer binaries with Solr 1.4.1/Lucene 2.9 Tom, am i mistaken? Your message says you are using Solr 1.4.1, thats whats confusing me. Did you actually receive this error on branch_3x Solr's analysis.jsp with an old TermAttribute-using TokenFilter? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target
Thanks Uwe, I recompiled my filter against revision 1030032 of Lucene/Solr and confirmed the same behavior (error message about CharTermAttributeImpl that is not in the target Then I applied your patch and recompiled Lucene/Solr. Your patch fixes the problem. Analysis.jsp now works fine with my filter. Opened issue SOLR-2234 Tom -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Thursday, November 11, 2010 2:49 PM To: dev@lucene.apache.org Subject: RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target Hi Tom, Can you try attached LuSolr patch? This is a problem of the backwards layer for CTA/TA coexistence. This is a hack, but ensures that for both attributes always use the same implementation class. If this fixes your bug can you open issue for 3.x and I will commit? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, November 11, 2010 8:20 PM To: dev@lucene.apache.org Subject: Re: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target On Thu, Nov 11, 2010 at 2:05 PM, Burton-West, Tom tburt...@umich.edu wrote: Sorry about the confusion (my confusion mostly:). I was actually using revision 1030032 of Lucene/Solr (see below) with a custom token filter that does not use CharTermAttribute. I'll recompile the custom filter against this revision and verify that the analysis.jsp produces the same results in a few minutes. Thanks Tom, this sounds like a good catch then. From your previous reply, i do think some of the issues discussed in SOLR-2051 could be related. As I mentioned there, this analysis.jsp is not well-behaved: it cross the tokenstreams, and really I think Uwe's comment at http://s.apache.org/n5 describes the proper solution, where it then is a well- behaved, more accurate representation of what is going on with analysis. I think this is why you probably don't have any other problems with your filter, except in this analysis.jsp. But, it would still be good to check that its not a general bug in AttributeSource.copyTo, because if so, someone will hit this problem with SynonymFilter combined with an old TokenStream. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Flex indexing : Hybrid index maintnenance for faster indexing
Thanks Mike, I suspected the approach might require architectural changes beyond flex, but since our indexes are so huge and disk I/O is our main bottleneck both for searching and indexing, I'm always looking for ways to deal with very large postings and positions lists that might reduce I/O. I haven't looked in detail into PFOR and Simple9 and some of the other new encodings, but my understanding is that they trade off compression for decompression speed. i.e. they take up a bit more space, but are more efficient to decompress. In our case, where we have underutilized CPU, mostly because the processors are waiting on disk I/O, I'll be curious to find out whether the slight increase in disk I/O time due to lower compression is still outweighed by the increase in decompression speed. (Don't know if we'll find the time to try flex for a while though:) BTW: have you seen this paper looking at 64-bit words? Index Compression Using 64-Bit Words, Anh, Moffat. Software -- Practice Experience, 40(2):131-148, February 2010 Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, October 05, 2010 6:21 AM To: dev@lucene.apache.org Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing Nice paper! It's a neat trick to index the large postings as separate files, ie let the fileystem handle the growth as new postings are appended over time. But, unfortunately, we can't easily do this in Lucene, since Lucene assumes index files are write once, and derives its transactional semantics from this approach. Ie, this would require sizable changes, beyond just swapping in a different Codec. Still, the idea that small/big postings lists should be handled differently is something we can take advantage of in a Codec, and I think we should. I think likely we will switch to a default codec that uses pulsing (storing term's postiugs directly in terms dict) for very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR for high freq terms. Mike On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi all, Would it be possible to implement something like this in Flex? Büttcher, S., Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8 The approach takes advantage of having a different policy for large postings lists (ie frequent terms) versus small postings lists for flushing the buffer and writing to disk. Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Flex indexing : Hybrid index maintnenance for faster indexing
Hi all, Would it be possible to implement something like this in Flex? Büttcher, S., Clarke, C. L. A. (2008). Hybrid index maintenance for contiguous inverted lists. Information Retrieval, 11(3), 175-207. doi:10.1007/s10791-007-9042-8 The approach takes advantage of having a different policy for large postings lists (ie frequent terms) versus small postings lists for flushing the buffer and writing to disk. Tom Burton-West - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Merge policy to merge during off-peak hours
Hello all, Lucene in Action 2nd Edition mentions a time-dependent merge policy that defers large merges until off-peak hours. (Section 2.13.6 p 71). Has anyone implemented such a policy? Is it worth opening a JIRA issue for this? Tom Burton-West www.hathitrust.org/blogs
RE: Benchmarking Solr indexing using Lucene Benchmark?
Thanks Jason, I'll take a look at how much work is involved and if getting it to work with the Solr config looks reasonably doable (in the time I have available), I give it a try and report back. Do you think it's worth opening a JIRA issue? Tom -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Monday, June 14, 2010 12:02 PM To: dev@lucene.apache.org Subject: Re: Benchmarking Solr indexing using Lucene Benchmark? Tom, This was discussed a while back, however I don't believe anything was committed. I think there's a fair bit of work involved in that the Lucene benchmark config would not be usable, or rather, it would need to simply point to a Solr solrconfig.xml file. Other than that, the resulting statistical reporting should be useful. Jason On Mon, Jun 14, 2010 at 8:57 AM, Burton-West, Tom tburt...@umich.edu wrote: Hi all, Posted this to the Solr users list and after a week with no responses, thought I would try the dev list. We are about to test out various factors to try to speed up our indexing process. One set of experiments will try various maxRamBufferSizeMB settings. Since the factors we will be varying are at the Lucene level, we are considering using the Lucene Benchmark utilities in Lucene/contrib. Have other Solr users used Lucene Benchmark? Can anyone provide any hints for adapting it to Solr? (Are there any common gotchas etc?). Tom Tom Burton-West University of Michigan Libraries http://www.hathitrust.org/blogs/large-scale-search - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Benchmarking Solr indexing using Lucene Benchmark?
Hi all, Posted this to the Solr users list and after a week with no responses, thought I would try the dev list. We are about to test out various factors to try to speed up our indexing process. One set of experiments will try various maxRamBufferSizeMB settings. Since the factors we will be varying are at the Lucene level, we are considering using the Lucene Benchmark utilities in Lucene/contrib.Have other Solr users used Lucene Benchmark? Can anyone provide any hints for adapting it to Solr? (Are there any common gotchas etc?). Tom Tom Burton-West University of Michigan Libraries http://www.hathitrust.org/blogs/large-scale-search
questions about DocsEnum.read()in flex api
I'm a bit confused about the DocsEnum.read() in the flex API. I have three questions: 1) DocsEnum.read() currently delegates to nextDoc() in the base class and there is a note that subclasses may do this more efficiently. Is there currently a more efficient implementation in a subclass? I didn't see one in MultiDocsEnum or MappingMultiDocsEnum, but perhaps I'm not understanding the code. 2) DocsEnum.read reads 64 docs/freqs at a time as set up in initBulkResult(). Would it make sense to have this configurable as an argument somewhere? I'm looking at very large indexes where a common term might occur in 100,000 or more docs. 3) At the very top of the JavaDoc there is a warning you must first call nextDoc It seems that this applies to calling DocsEnum.docID() or DocsEnum.freq() but not to DocsEnum.read(). Is that correct? Tom Burton-West
RE: questions about DocsEnum.read()in flex api
Thanks Mike! A follow-up question: DocsEnum.read() currently delegates to nextDoc() in the base class and there is a note that subclasses may do this more efficiently. Is there currently a more efficient implementation in a subclass? Yes, the standard codec does so (StandardPostingsReaderImpl.java). I assume that the standard codec is the default. Will what I'm using in HighFreqTermsWithTF to instantiate an IndexReader (below) eventually end up instantiating the StandardPostingReaderImpl or do I need to do something explicitly that will cause it to be instantiated? dir = FSDirectory.open(new File(args[0])); reader = IndexReader.open(dir, true); Tom - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Fix to contrib/misc/HighFreqTerms.java
Hi Mike, Thanks for making the fix and changing the display from bytes to utf8. It needs a very minor change: The latest fix converts to utf8 if you give a field argument on the command line but still shows bytes if you don't. Line 89 should parallel line 70 and use term.utf8ToString() instead of term.toString; 70 tiq.insertWithOverflow(new TermInfo(new Term(field, term.utf8ToString()), termsEnum.docFreq())); 89 tiq.insertWithOverflow(new TermInfo(new Term(field, term.toString()), terms.docFreq())); Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, April 14, 2010 3:50 PM To: java-dev@lucene.apache.org Subject: Re: Bug in contrib/misc/HighFreqTerms.java? OK I committed the fix. I ran it on a flex wikipedia index I had... it produces output like this: body:[3c 21 2d 2d] 509050 body:[73 68 6f 75 6c 64] 515495 body:[74 68 65 6e] 525176 body:[74 69 74 6c 65] 525361 body:[5b 5b 55 6e 69 74 65 64] 532586 body:[6b 6e 6f 77 6e] 533558 body:[75 6e 64 65 72] 536480 body:[55 6e 69 74 65 64] 543746 Which is not very readable, but, it does this because flex terms are arbitrary byte[], not necessarily utf8... maybe we should fix it to print both hex and String if we assume bytes are utf8? Mike On Wed, Apr 14, 2010 at 3:25 PM, Michael McCandless luc...@mikemccandless.com wrote: Ugh, I'll fix this. With the new flex API, you can't ask a composite (Multi/DirReader) for its postings -- you have to go through the static methods on MultiFields. I'm trying to put some distance b/w IndexReader and composite readers... because I'd like to eventually deprecate them. Ie, the composite readers should hold an ordered collection of sub-readers, but should not themselves implement IndexReader's API, I think. Thanks for raising this Tom, Mike On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom tburt...@umich.edu wrote: When I try to run HighFreqTerms.java in Lucene Revision: 933722 I get the the exception appended below. I believe the line of code involved is a result of the flex indexing merge. Should I post this as a comment to LUCENE-2370 (Reintegrate flex branch into trunk)? Or is there simply something wrong with my configuration? Exception in thread main java.lang.UnsupportedOperationException: please use MultiFields.getFields if you really need a top level Fields (NOTE that it's usually better to work per segment instead) at org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762) at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71) Tom Burton-West - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Bug in contrib/misc/HighFreqTerms.java?
When I try to run HighFreqTerms.java in Lucene Revision: 933722 I get the the exception appended below. I believe the line of code involved is a result of the flex indexing merge. Should I post this as a comment to LUCENE-2370 (Reintegrate flex branch into trunk)? Or is there simply something wrong with my configuration? Exception in thread main java.lang.UnsupportedOperationException: please use MultiFields.getFields if you really need a top level Fields (NOTE that it's usually better to work per segment instead) at org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762) at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71) Tom Burton-West
Solr BufferedTokenStream and new Lucene 2.9 TokenStream API
Hello all, Would it be appropriate to open a JIRA issue to get converting the Solr BufferedTokenStream class to use the new Lucene 2.9 token API on the todo list ? Alternatively is there a more general issue already open regarding Solr filters and the new API? (I couldn't find one) Or is it better to wait until the Lucene 2.9 API becomes final (https://issues.apache.org/jira/browse/LUCENE-1693) before opening a JIRA issue? Tom Burton-West
How to contribute question (patch against release or latest trunk?)
Hello, I read the How to Contribute page on the wiki and want to make a patch. Do I make the patch against the latest Solr trunk or against the last release? Tom
Tests fail for solrj.embedded on windows (Release 78676 and 775664 )
Hello all, About every other time I check-out a current version of trunk and run the tests, the tests for solrj.embedded.* fail. I'm running under windows XP with java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) With the latest release 786676, I get these two failure messages: [junit] Running org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.188 sec [junit] Test org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest FAILED [junit] Running org.apache.solr.client.solrj.embedded.MultiCoreEmbeddedTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.187 sec [junit] Test org.apache.solr.client.solrj.embedded.MultiCoreEmbeddedTest FAILED I previously had failures with Release 775664 for org.apache.solr.client.solrj.embedded.SolrExampleStreamingTest With a slightly earlier version of the jdk http://issues.apache.org/jira/browse/SOLR-1014?focusedCommentId=12710502#action_12710502 Is there some magic setting, environment variable or junit version that I am missing? What is the recommended workaround? Tom Burton-West
How to Contribute question
Hello, I read the How to Contribute document on the wiki. (http://wiki.apache.org/solr/HowToContribute#head-385f123f540367646df16825ca043d0098b31365) I have written a custom analyzer https://issues.apache.org/jira/browse/SOLR-908 and would like to create a patch as documented in the wiki. My question is where should I put my files in the source tree to generate the patch? Should they go in trunk/contribute/mycode or src/java/org/apache/solr/analysis? and /src/test/org/apache/solr/analysis? Tom Burton-West tburt...@umich.edu