RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

2011-12-17 Thread Burton-West, Tom
Thanks Robert,



Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.

...So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.

So the ICUTokenizer would have to add that word-final syllable attribute based 
on some rules and then a downstream filter could use the attributes to constuct 
bigrams without creating stupid bigrams.

If we end up doing the project, we will be working with people who have 
expertise in Tibetan and hopefully would be able to tell us the rules of the 
game  

Tom

___


Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.
Its not a difficult thing to do for the tokenizer, but we would need
more details: a quick glance at some stuff on tibetan punctuation
indicates its not 'this simple': for some syllables sometimes the
punctuation is omitted. Honestly i don't know why this is, maybe it
means there are some syllables that only appear in word-final
position? If so, such important clues should also trigger this
attribute. So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

2011-12-16 Thread Burton-West, Tom
The ICUTokenizer now adds a script attribute for tokens (as do Standard 
Tokenizer and a couple of others (LUCENE-2911)  For example Tibetan or Han. 
  If the Shingle filter had some provision to only make token n-grams when the 
script attribute matched some specified script, it would solve both the need to 
produce character bigrams for CJK ( Han)  and syllable bigrams for Tibetan.  We 
already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) .

Would it make sense to open an issue for modifying the Shingle filter to have 
configurable script-specific behavior, or is this just another use case for 
LUCENE 2906?

If it is another use case for LUCENE 2906, then perhaps we need to change the 
summary of the issue to generalize it beyond CJK.

Any suggestions ?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

2011-12-16 Thread Burton-West, Tom
Hi Robert,

Thanks for the quick and thoughtful response. 

I didn't realize these complexities and thought maybe there was an easy 
solution :)

We may be involved in a project that involves Tibetan text and given our 
current resources and priorities, we would stick it in the same field as the 
other 400+ languages.  I was hoping that with the script attribute output by 
the ICUTokenizer, we could figure out something to do script/language specific 
processing for Tibetan without adversely affecting anything else. 

. I suppose to inhibit stupid bigrams you would *not*shingle across shad as 
well

Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan 
phrase separators but downstream filters won't know that, so we couldn't have a 
downstream filter that avoided bigramming across a phrase separator. On the 
other hand it might be that stupid overlapping bigrams don't hurt retrieval 
compared to treating syllables as if they were words i.e. syllable unigrams. ( 
I've not been able to find much published research in English on the issue, and 
many of the references are to articles in Chinese language publications.  I'm 
pretty much relying on the article by Hackett and Oard) 


Tom


Hackett, P. G.,  Oard, D. W. (2000). Comparison of word-based and 
syllable-based retrieval for Tibetan (poster session). In Proceedings of the 
fifth international workshop on on Information retrieval with Asian languages - 
IRAL '00 (pp. 197-198). Presented at the the fifth international workshop on, 
Hong Kong, China. doi:10.1145/355214.355242

http://dl.acm.org/citation.cfm?doid=355214.355242

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, December 16, 2011 6:45 PM
To: dev@lucene.apache.org
Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer 
and LUCENE-2906

On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The ICUTokenizer now adds a script attribute for tokens (as do Standard
 Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
 “Han”.   If the Shingle filter had some provision to only make token n-grams
 when the script attribute matched some specified script, it would solve both
 the need to produce character bigrams for CJK ( Han) and syllable bigrams
 for Tibetan.  We already opened an issue to create overlapping bigrams for
 CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs cjk text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
'character'.
2. Unlike the CJK case, where you bigram a run, Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other complex languages besides these are also emitting syllables
at best, too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put we think here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.


 Would it make sense to open an issue for modifying the Shingle filter to
 have configurable script-specific behavior, or is this just another use case
 for LUCENE 2906?

 If it is another use case for LUCENE 2906, then perhaps we

re: LUCENE-167 and Solr default handling of Boolean operators is broken

2011-12-01 Thread Burton-West, Tom
The default query parser in Solr does not handle precedence of Boolean 
operators in the way most people expect.

A AND B OR C gets interpreted as A AND (B OR C) . There are numerous other 
examples in the JIRA ticket for Lucene 167, this article on the wiki 
http://wiki.apache.org/lucene-java/BooleanQuerySyntax and in this blog post: 
http://robotlibrarian.billdueber.com/solr-and-boolean-operators/

This issue was reported in 2003 but the fix does not seem to have made it into 
the default query parser for either Lucene or Solr

It appears that Lucene 167 was closed in 2009 based on the assumption that the 
query parser in Lucene 1823 would become the default Lucene query parser.  
However 1823 seems to have gotten bogged down and is not yet resolved.  I do 
see that there is a precedence query parser in LUCENE-1937  which was committed 
to contrib. in  the 3x 
branch:(http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/precedence/package.html?view=co)

Would it be possible to use the contrib 3x  precedence query parser in Solr?
Would this require modifying the LuceneQParserPlugin and if so would it make 
sense to open a JIRA issue?

Are there any plans to make the precedence query parser the default for either 
Lucene or Solr?

If not, are there any plans to make it more prominent in the documentation that 
the default Lucene query parser has issues with precedence?


A bit more background below

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search


More Background

There were some concerns about breaking backward compatibility but in a mailing 
list post in 2005  Yonik  Sealy said:
The current behavior is so surprising that I doubt  that no one is
relying on it.  
(http://www.mail-archive.com/java-user@lucene.apache.org/msg00018.html)

and Doug Cutting said  +1. Fixing operator precedence seems to me like an 
acceptable incompatibility. The change needs to be well documented in release 
notes, and the old QueryParser should be available, deprecated, for a time for 
back-compatibility.
(http://www.mail-archive.com/java-user@lucene.apache.org/msg00037.html)





RE: LUCENE-167 and Solr default handling of Boolean operators is broken

2011-12-01 Thread Burton-West, Tom
Thanks Yonik,

Should I open a Solr JIRA issue?

Tom

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Thursday, December 01, 2011 1:16 PM
To: dev@lucene.apache.org
Subject: Re: LUCENE-167 and Solr default handling of Boolean operators is broken

Whew, that was a while ago - didn't remember even commenting on the
issue, but it still makes sense (double-negative aside... boy I hate
re-reading things I wrote to quickly ;-)

The old precedence query parser had issues IIRC.  The precedence query
parser based on the flexible queryparser framework in contrib isn't
that Solr friendly (i.e. Solr has a lot of hooks into the current
standard query parser and moving would probably be both error prone
and difficult).

SolrCloud is consuming my time right now, but I might be able to take
look to see if this is easy to fix in another month or so (if no one
beats me to it).  Since it's a major release, we may be able to just
fix it in trunk w/o having to keep the old behavior.

-Yonik
http://www.lucidimagination.com



On Thu, Dec 1, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The default query parser in Solr does not handle precedence of Boolean
 operators in the way most people expect.

 A AND B OR C gets interpreted as A AND (B OR C) . There are numerous
 other examples in the JIRA ticket for Lucene 167, this article on the wiki
 http://wiki.apache.org/lucene-java/BooleanQuerySyntax and in this blog post:
 http://robotlibrarian.billdueber.com/solr-and-boolean-operators/

 This issue was reported in 2003 but the fix does not seem to have made it
 into the default query parser for either Lucene or Solr

 It appears that Lucene 167 was closed in 2009 based on the assumption that
 the query parser in Lucene 1823 would become the default Lucene query
 parser.  However 1823 seems to have gotten bogged down and is not yet
 resolved.  I do see that there is a precedence query parser in LUCENE-1937
 which was committed to contrib. in  the 3x
 branch:(http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/precedence/package.html?view=co)

 Would it be possible to use the contrib 3x precedence query parser in Solr?
 Would this require modifying the LuceneQParserPlugin and if so would it make
 sense to open a JIRA issue?

 Are there any plans to make the precedence query parser the default for
 either Lucene or Solr?

 If not, are there any plans to make it more prominent in the documentation
 that the default Lucene query parser has issues with precedence?


 A bit more background below

 Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search
 

 More Background

 There were some concerns about breaking backward compatibility but in a
 mailing list post in 2005  Yonik Sealy said:
 The current behavior is so surprising that I doubt  that no one is
 relying on it.
 (http://www.mail-archive.com/java-user@lucene.apache.org/msg00018.html)

 and Doug Cutting said  +1. Fixing operator precedence seems to me like an
 acceptable incompatibility. The change needs to be well documented in
 release notes, and the old QueryParser should be available, deprecated, for
 a time for back-compatibility.
 (http://www.mail-archive.com/java-user@lucene.apache.org/msg00037.html)




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Solr should provide an option to show only most relevant facet values

2011-09-27 Thread Burton-West, Tom
Hello all,

This post is getting no replies after several days on the Solr user list, so I 
thought I would rewrite it as a question about a possible feature for Solr.

In our use case we have a large number of documents and several facets such as 
Author and Subject, that have a very large number of values.  Since we index 
the full text of nearly 10 million books, it is easy for a query to return a 
very large number of hits.

Here is the problem:

If relevance ranking is working well, in theory it doesn't matter how many hits 
the user gets as long as the best results show up in the first page of results. 
 When showing facet values, if the values for a particular facet have a large 
number of values, the general practice is to show a relatively small number of 
facet values, selected as those values with the highest counts in the entire 
result set.   However, assuming a very large result set, these facet counts 
will be affected by the large number of results that are not relevant to the 
query

As an example, if you search in our full-text collection for jaguar you get 
170,000 hits.  If I am looking for the car rather than the OS or the animal, I 
might expect to be able to click on a facet and limit my results to the car.  
However, facets containing the word car or automobile are not in the top 5 
facets that we show.  If you click on more  you will see automobile 
periodicals but not the rest of the facets containing the word automobile .  
This occurs because the facet counts are for all 170,000 hits.  The facet 
counts  for at least 160,000 irrelevant hits are included (assuming only the 
top 10,000 hits are relevant) .

What we would like to do is *select* which facet values to show, based on their 
counts in the *most relevant subset* of documents, but display the actual 
counts for the full set:

1)  get the facet counts for the N most relevant documents (N = 10,000 for 
example)
2)  select the 5 or 30 facet values with the highest counts for those 
relevant documents.
3)  display only the facet values for those 5 or 30 values, but display the 
counts for those values against the entire result set.

This is possible to kludge up (subject to some scaling considerations) in the 
following way:

1)  Consider only the 1000 most relevant documents for doing the 
calculation so N = 1,000
2)  do your query and get the unique document ids for the N most relevant 
documents. (i.e. set rows=N)  also get the facet values and counts for the top 
M facets, where M is some very large number and store the facet values and 
counts in some data structure.
3)  run a second query which is the same as the first, but add a filter 
query for those 1000 unique  ids, set rows =1 but get facet counts for the top 
30 facet values
4)  Grab the top 5 or 30 facet values from this second query.  These are 
your most relevant facet values
5)  Use the list of values from the previous step to retrieve the 
appropriate counts for the whole result set from the earlier stage where you 
stored the facet counts for the whole result set

It would seem that this could be done much more efficiently inside of 
Solr/Lucene, since instead of getting the unique ids for the N most relevant 
documents, and sending those back to Solr, the code actually has access to 
bitsets containing the internal Lucene index ids which get used in the filter 
queries.  Other steps in the process could probably be streamlined as well.

Is there already some faceting code work being done along this line?
Would it make sense to open a JIRA issue for this?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search




RE: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should read words in a comma-delimited format

2011-06-06 Thread Burton-West, Tom
Hi David,

Just curious about your use of the HathiTrust list.  I usually explain to 
people that it's customized to our index and they are probably better off 
making their own list based on the lists of stop words appropriate for the 
languages in their index (sources listed in the blog post 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance)  
If you already have an index built and are re-indexing with CommonGrams , you 
can also use the -t flag with HighFreqTerms.java in lucene contrib to determine 
the words that have the largest position lists and are therefore candidates to 
be added to your CommonGrams word list.  We recently ran HighFreqTerms.java 
against our indexes and discovered that it would be better to remove some of 
the less frequent foreign language stopwords and instead use some very frequent 
words from the index.

Tom Burton-West
www.hathitrust.org/blogs

From: Steven Rowe (JIRA) [j...@apache.org]
Sent: Monday, June 06, 2011 2:08 PM
To: dev@lucene.apache.org
Subject: [jira] [Resolved] (SOLR-1844) CommonGramsQueryFilterFactory should 
read words in a comma-delimited format

 [ 
https://issues.apache.org/jira/browse/SOLR-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe resolved SOLR-1844.
---

Resolution: Won't Fix
  Assignee: Steven Rowe

Thanks David.

 CommonGramsQueryFilterFactory should read words in a comma-delimited format
 ---

 Key: SOLR-1844
 URL: https://issues.apache.org/jira/browse/SOLR-1844
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: David Smiley
Assignee: Steven Rowe
Priority: Minor

 CommonGramsQueryFilterFactory expects that the file(s) given to the words 
 argument is a carriage-return delimited list of words.  It doesn't support 
 comments either.  This file format should be more flexible to support comma 
 delimited values.  I came across this because I was trying to use the sample 
 file provided by HathiTrust:
 http://www.hathitrust.org/node/180(named in a file new400common.txt)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: MergePolicy Thresholds

2011-05-20 Thread Burton-West, Tom
Hi Mike and Shai,

I was able to index  a few documents with the tieredMergePolicy but I was 
hoping to build a large test index of about 700,000 documents to compare the 
performance against our previous runs.  I was hoping I would be able to report 
on my results in time for the Lucene Revolution conference.  Unfortunately 
there was a power outage at our data center last week which resulted in a node 
failure in one of our storage nodes and node rebalancing for a cluster of 500 
terabytes takes quite a while and totally messes up performance measurements.  
(Our 6-8 terabytes of large scale search indexes shares storage with the 
repository that holds the 480+ terabytes of page images and metadata for the 8 
million+ books).   Hopefully I will be able to run the tests when I get back.

Tom

From: Burton-West, Tom [mailto:tburt...@umich.edu]
Sent: Monday, May 09, 2011 4:10 PM
To: dev@lucene.apache.org
Subject: RE: MergePolicy Thresholds

Thanks again Shai and Mike.

Am in the process of downloading and building   r108.  Should be able to 
build a test index sometime this week.  I'll make some guesses on what 
parameters to use based on our previous tests.

Tom
From: Shai Erera [mailto:ser...@gmail.com]
Sent: Saturday, May 07, 2011 11:33 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai
On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom 
tburt...@umich.edumailto:tburt...@umich.edu wrote:
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-Original Message-
From: Michael McCandless 
[mailto:luc...@mikemccandless.commailto:luc...@mikemccandless.com]
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.orgmailto:dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera 
ser...@gmail.commailto:ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera 
 ser...@gmail.commailto:ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera 
  ser...@gmail.commailto:ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk), so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: 
  dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: 
  dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: 
 dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr

RE: MergePolicy Thresholds

2011-05-03 Thread Burton-West, Tom
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk), so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: MergePolicy Thresholds

2011-05-02 Thread Burton-West, Tom
Hi Shai and Mike,

Testing the TieredMP on our large indexes has been on my todo list since I read 
Mikes blog post
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

If you port it to the 3.x branch Shai, I'll be more than happy to test it with 
our very large (300GB+) indexes.  Besides being able to set the max merged 
segment size, I'm especially interested in using the  maxSegmentsPerTier 
parameter.

From Mike's blog post:
 ...maxSegmentsPerTier that lets you set the allowed width (number of 
segments) of each stair in the staircase. This is nice because it decouples how 
many segments to merge at a time from how wide the staircase can be.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, May 02, 2011 2:19 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
 Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
 way, or do you think it can easily be ported to 3x?
 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Link to nightly build test reports on main Lucene site needs updating

2011-05-02 Thread Burton-West, Tom
Thanks for fixing++

Tom

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Sunday, May 01, 2011 6:05 AM
To: dev@lucene.apache.org; simon.willna...@gmail.com; 
java-u...@lucene.apache.org
Subject: RE: Link to nightly build test reports on main Lucene site needs 
updating

I fixed the nightly docs, once the webserver mirrors them from SVN they should 
appear. The developer-resources page was completely broken. It now also 
contains references to the stable 3.x branch as most users would prefer that 
one to fix latest bugs but don’t want to have a backwards-incompatible version.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de




RE: Using contrib Lucene Benchmark with Solr

2011-03-31 Thread Burton-West, Tom
Thanks Robert and Grant,

Does this need a separate JIRA issue dealing specifically with the ability of 
benchmark to read Solr config settings, or is it subsumed in LUCENE-2845? or 
should I just add a comment to LUCENE-2845?

Tom
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Wednesday, March 30, 2011 7:56 PM
To: dev@lucene.apache.org
Subject: Re: Using contrib Lucene Benchmark with Solr

On Wed, Mar 30, 2011 at 4:49 PM, Burton-West, Tom tburt...@umich.edu wrote:
 I would like to be able to use the Lucene Benchmark code with Solr to run
 some indexing tests.  It would be nice if Lucene Benchmark to could read
 Solr configuration rather than having to translate my filter chain and other
 parameters into Lucene.   Would it be appropriate to open a JIRA issue for
 this or is this something that doesn’t really make any sense?


I think it makes great sense, we moved the benchmarking facility to a
toplevel module so we can do this:
https://issues.apache.org/jira/browse/LUCENE-2845, but we didn't
actually add any integration yet.

I've been in this exact same situation too when trying to use the
benchmark package, and I'd sure like to see better solr integration
with the benchmarking package myself.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Using contrib Lucene Benchmark with Solr

2011-03-30 Thread Burton-West, Tom
I would like to be able to use the Lucene Benchmark code with Solr to run some 
indexing tests.  It would be nice if Lucene Benchmark to could read Solr 
configuration rather than having to translate my filter chain and other 
parameters into Lucene.   Would it be appropriate to open a JIRA issue for this 
or is this something that doesn't really make any sense?

Tom



RE: Is it possible to set the merge policy setMaxMergeMB from Solr

2010-12-17 Thread Burton-West, Tom
I'm a bit confused.  

There are some examples in the JIRA issue for Solr 1447, but I can't tell from 
reading it what the final allowed syntax is.

I see 

!--mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy--
  !--double name=maxMergeMB64.0/double--
!--/mergePolicy--
in the JIRA issue and in what I think is the test case config file:
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/src/test/test-files/solr/conf/solrconfig-propinject.xml?view=log

Lance's example is 

mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy
maxMergeMB1024/maxMergeMB
/mergePolicy

Which one is correct?

Tom

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Tuesday, December 07, 2010 10:48 AM
To: dev@lucene.apache.org
Subject: Re: Is it possible to set the merge policy setMaxMergeMB from Solr

SOLR-1447 added this functionality.

On Mon, Dec 6, 2010 at 2:34 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Lucene has this method to set the maximum size of a segment when merging:
 LogByteSizeMergePolicy.setMaxMergeMB
 (http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29
 )

 I would like to be able to set this in my solrconfig.xml.  Is this
 possible?  If not should I open a JIRA issue or is there some gotcha I am
 unaware of?

 Tom

 Tom Burton-West


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Is it possible to set the merge policy setMaxMergeMB from Solr

2010-12-06 Thread Burton-West, Tom
Lucene has this method to set the maximum size of a segment when merging: 
LogByteSizeMergePolicy.setMaxMergeMB   
(http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/LogByteSizeMergePolicy.html#setMaxMergeMB%28double%29
 )

I would like to be able to set this in my solrconfig.xml.  Is this possible?  
If not should I open a JIRA issue or is there some gotcha I am unaware of?

Tom

Tom Burton-West



Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
Hello all,

I am using Solr 1.4.1 and a custom filter that worked with a previous version 
of Solr that used Lucene 2.9.  When I try to use the analysis console I get 
this error message:

  java.lang.IllegalArgumentException: This AttributeSource contains 
AttributeImpl of type 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in 
the target
(See below for stack trace that shows this is an interaction of the custom 
punctuation filter and the Analysis jsp)

I believe this has to do with this JIRA issue: 
https://issues.apache.org/jira/browse/LUCENE-2302

I looked at the most recent org.apache.lucene.analysis package document 
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/package.html?view=co
  but didn't see a mention of CharTermAttributeImpl

Can someone point me to the documentation or example code that might explain 
the issue?

Tom Burton-West
Stack trace excerpt:

Caused by: java.lang.IllegalArgumentException: This AttributeSource contains 
AttributeImpl of type 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in 
the target
at 
org.apache.lucene.util.AttributeSource.copyTo(AttributeSource.java:493)
at 
org.apache.jsp.admin.analysis_jsp$1.incrementToken(org.apache.jsp.admin.analysis_jsp:102)
at 
org.apache.solr.analysis.PunctuationFilter.incrementToken(PunctuationFilter.java:40)
at 
org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:131)
at 
org.apache.jsp.admin.analysis_jsp.doAnalyzer(org.apache.jsp.admin.analysis_jsp:110)
at 
org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:718)




RE: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
 Something here is using lucene 3.x or trunk code, since
CharTermAttribute[Impl] only exists in unreleased versions!

Doh!   I forgot to switch my binaries back to Solr 1.4.1 from 3.x.  Thanks for 
the catch Robert.  The subject line should read: Solr/Lucene 3.x Analysis 
console gives error regarding CharTermAttributeImpl that is not in the target

I do need to port my filter to lucene 3.x, so is there 3.x documentation about 
use of CharTermAttributeImpl?

Is this something that needs to be in the TokenStream examples in the 3.0.2 
org.apache.lucene.analysis package.html?

Tom


RE: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
Ok, I was using a recent unreleased version of Solr/Lucene but looking at the 
Lucene 3.0.2 docs instead of the nightly build docs.  Found the answer I needed 
in the nightly build docs.
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//core/org/apache/lucene/analysis/package-summary.html

Tom


-Original Message-
From: Burton-West, Tom [mailto:tburt...@umich.edu] 
Sent: Thursday, November 11, 2010 1:26 PM
To: dev@lucene.apache.org
Subject: RE: Solr 1.4.1 Analysis console gives error regarding 
CharTermAttributeImpl that is not in the target

 Something here is using lucene 3.x or trunk code, since
CharTermAttribute[Impl] only exists in unreleased versions!

Doh!   I forgot to switch my binaries back to Solr 1.4.1 from 3.x.  Thanks for 
the catch Robert.  The subject line should read: Solr/Lucene 3.x Analysis 
console gives error regarding CharTermAttributeImpl that is not in the target

I do need to port my filter to lucene 3.x, so is there 3.x documentation about 
use of CharTermAttributeImpl?

Is this something that needs to be in the TokenStream examples in the 3.0.2 
org.apache.lucene.analysis package.html?

Tom


RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
Thanks Uwe,

A bug in analysis.jsp is consistent with what I am seeing.  I can run 
explain/debug queries using my filter in the Solr/Lucene 3.x version and it’s 
clearly working.  However I get the error when I try the analysis console.  Is 
this the same issue as SOLR-2051?

Tom

From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Thursday, November 11, 2010 1:49 PM
To: Burton-West, Tom; dev@lucene.apache.org
Subject: Antw.: Solr 1.4.1 Analysis console gives error regarding 
CharTermAttributeImpl that is not in the target

I still think this is a bug in analysis.jsp. Copyto does not work here 
correctly because it tries to copy a ta to cta.seems that analysis.hap does 
generate the Target attributesource incorrect.

I will look into this.
---
Uwe Schindler
Generics Policeman
Bremen, Germany

- Reply message -
Von: Burton-West, Tom tburt...@umich.edu
Datum: Do., Nov. 11, 2010 19:03
Betreff: Solr 1.4.1 Analysis console gives  error regarding 
CharTermAttributeImpl that is not in the target
An: dev@lucene.apache.org dev@lucene.apache.org

Hello all,

I am using Solr 1.4.1 and a custom filter that worked with a previous version 
of Solr that used Lucene 2.9.  When I try to use the analysis console I get 
this error message:

 java.lang.IllegalArgumentException: This AttributeSource contains 
AttributeImpl of type 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in 
the target
(See below for stack trace that shows this is an interaction of the custom 
punctuation filter and the Analysis jsp)

I believe this has to do with this JIRA issue: 
https://issues.apache.org/jira/browse/LUCENE-2302

I looked at the most recent org.apache.lucene.analysis package document 
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/package.html?view=co
  but didn't see a mention of CharTermAttributeImpl

Can someone point me to the documentation or example code that might explain 
the issue?

Tom Burton-West
Stack trace excerpt:

Caused by: java.lang.IllegalArgumentException: This AttributeSource contains 
AttributeImpl of type 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl that is not in 
the target
   at 
org.apache.lucene.util.AttributeSource.copyTo(AttributeSource.java:493)
   at 
org.apache.jsp.admin.analysis_jsp$1.incrementToken(org.apache.jsp.admin.analysis_jsp:102)
   at 
org.apache.solr.analysis.PunctuationFilter.incrementToken(PunctuationFilter.java:40)
   at 
org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:131)
   at 
org.apache.jsp.admin.analysis_jsp.doAnalyzer(org.apache.jsp.admin.analysis_jsp:110)
   at 
org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:718)





RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
Sorry about the confusion (my confusion mostly:).   I was actually using 
revision 1030032 of Lucene/Solr (see below)
with a custom token filter that does not use CharTermAttribute.  I'll recompile 
the custom filter against this revision and verify that the analysis.jsp 
produces the same results in a few minutes.

Solr Specification Version: 3.0.0.2010.11.03.16.59.02
Solr Implementation Version: 3.1-SNAPSHOT 1030032 - tburtonw - 
2010-11-03 16:59:02
Lucene Specification Version: 3.1-SNAPSHOT
Lucene Implementation Version: 3.1-SNAPSHOT 1030032 - 2010-11-03 
17:00:44

Tom
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Thursday, November 11, 2010 1:54 PM
To: dev@lucene.apache.org
Subject: Re: Antw.: Solr 1.4.1 Analysis console gives error regarding 
CharTermAttributeImpl that is not in the target

On Thu, Nov 11, 2010 at 1:49 PM, Uwe Schindler u...@thetaphi.de wrote:
 I still think this is a bug in analysis.jsp. Copyto does not work here
 correctly because it tries to copy a ta to cta.seems that analysis.hap does
 generate the Target attributesource incorrect.

 I will look into this.

I think (perhaps I am mistaken), that Tom somehow mixed up some newer
binaries with Solr 1.4.1/Lucene 2.9

Tom, am i mistaken? Your message says you are using Solr 1.4.1, thats
whats confusing me.

Did you actually receive this error on branch_3x Solr's analysis.jsp
with an old TermAttribute-using TokenFilter?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Antw.: Solr 1.4.1 Analysis console gives error regarding CharTermAttributeImpl that is not in the target

2010-11-11 Thread Burton-West, Tom
Thanks Uwe,

I recompiled my filter against revision 1030032 of Lucene/Solr and confirmed 
the same behavior (error message about CharTermAttributeImpl that is not in 
the target  Then I applied your patch and recompiled Lucene/Solr.  Your patch 
fixes the problem.  Analysis.jsp now works fine with my filter.

Opened issue SOLR-2234

Tom

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Thursday, November 11, 2010 2:49 PM
To: dev@lucene.apache.org
Subject: RE: Antw.: Solr 1.4.1 Analysis console gives error regarding 
CharTermAttributeImpl that is not in the target

Hi Tom,

Can you try attached LuSolr patch? This is a problem of the backwards layer for 
CTA/TA coexistence. This is a hack, but ensures that for both attributes always 
use the same implementation class.

If this fixes your bug can you open issue for 3.x and I will commit?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Thursday, November 11, 2010 8:20 PM
 To: dev@lucene.apache.org
 Subject: Re: Antw.: Solr 1.4.1 Analysis console gives error regarding 
 CharTermAttributeImpl that is not in the target
 
 On Thu, Nov 11, 2010 at 2:05 PM, Burton-West, Tom tburt...@umich.edu
 wrote:
  Sorry about the confusion (my confusion mostly:).   I was actually
  using revision 1030032 of Lucene/Solr (see below) with a custom 
  token filter
 that does not use CharTermAttribute.  I'll recompile the custom filter 
 against this revision and verify that the analysis.jsp produces the 
 same results in a few minutes.
 
 
 Thanks Tom, this sounds like a good catch then. From your previous 
 reply, i do think some of the issues discussed in SOLR-2051 could be related.
 
 As I mentioned there, this analysis.jsp is not well-behaved: it cross 
 the tokenstreams, and really I think Uwe's comment at
 http://s.apache.org/n5 describes the proper solution, where it then is 
 a well- behaved, more accurate representation of what is going on with 
 analysis.
 
 I think this is why you probably don't have any other problems with 
 your filter, except in this analysis.jsp.
 
 But, it would still be good to check that its not a general bug in 
 AttributeSource.copyTo, because if so, someone will hit this problem 
 with SynonymFilter combined with an old TokenStream.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For 
 additional commands, e-mail: dev-h...@lucene.apache.org



RE: Flex indexing : Hybrid index maintnenance for faster indexing

2010-10-05 Thread Burton-West, Tom
Thanks Mike,

I suspected the approach might require architectural changes beyond flex, but 
since our indexes are so huge and disk I/O is our main bottleneck both for 
searching and indexing, I'm always looking for ways to deal with very large 
postings and positions lists that might reduce I/O.

I haven't looked in detail into PFOR and Simple9 and some of the other new 
encodings, but my understanding is that they trade off compression for 
decompression speed. i.e. they take up a bit more space, but are more efficient 
to decompress.   In our case, where we have underutilized CPU, mostly because 
the processors are waiting on disk I/O, I'll be curious to find out whether the 
slight increase in disk I/O time due to lower compression is still outweighed 
by the increase in decompression speed. (Don't know if we'll find the time to 
try flex for a while though:)


BTW: have you seen this paper looking at 64-bit words?
 Index Compression Using 64-Bit Words, Anh, Moffat. Software -- Practice  
Experience, 40(2):131-148, February 2010


Tom 
-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, October 05, 2010 6:21 AM
To: dev@lucene.apache.org
Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing

Nice paper!

It's a neat trick to index the large postings as separate files, ie
let the fileystem handle the growth as new postings are appended
over time.

But, unfortunately, we can't easily do this in Lucene, since Lucene
assumes index files are write once, and derives its transactional
semantics from this approach.  Ie, this would require sizable changes,
beyond just swapping in a different Codec.

Still, the idea that small/big postings lists should be handled
differently is something we can take advantage of in a Codec, and I
think we should.  I think likely we will switch to a default codec
that uses pulsing (storing term's postiugs directly in terms dict) for
very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
for high freq terms.

Mike

On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi all,

 Would it be possible to implement something like this in Flex?


 Büttcher, S.,  Clarke, C. L. A. (2008). Hybrid index maintenance for 
 contiguous inverted lists. Information Retrieval, 11(3), 175-207. 
 doi:10.1007/s10791-007-9042-8

 The approach takes advantage of having a different policy for large postings 
 lists (ie frequent terms)  versus small postings lists for flushing the 
 buffer and writing to disk.


 Tom Burton-West

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Flex indexing : Hybrid index maintnenance for faster indexing

2010-10-04 Thread Burton-West, Tom
Hi all,

Would it be possible to implement something like this in Flex?


Büttcher, S.,  Clarke, C. L. A. (2008). Hybrid index maintenance for 
contiguous inverted lists. Information Retrieval, 11(3), 175-207. 
doi:10.1007/s10791-007-9042-8  

The approach takes advantage of having a different policy for large postings 
lists (ie frequent terms)  versus small postings lists for flushing the buffer 
and writing to disk.


Tom Burton-West

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Merge policy to merge during off-peak hours

2010-07-12 Thread Burton-West, Tom
Hello all,

Lucene in Action 2nd Edition mentions a time-dependent merge policy that defers 
large merges until off-peak hours.  (Section 2.13.6 p 71).
Has anyone implemented such a policy?  Is it worth  opening a JIRA issue for 
this?

Tom Burton-West
www.hathitrust.org/blogs



RE: Benchmarking Solr indexing using Lucene Benchmark?

2010-06-15 Thread Burton-West, Tom
Thanks Jason,

I'll take a look at how much work is involved and if getting it to work with 
the Solr config looks reasonably doable (in the time I have available), I give 
it a try and report back.  Do you think it's worth opening a JIRA issue?

Tom

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Monday, June 14, 2010 12:02 PM
To: dev@lucene.apache.org
Subject: Re: Benchmarking Solr indexing using Lucene Benchmark?

Tom,

This was discussed a while back, however I don't believe
anything was committed. I think there's a fair bit of work
involved in that the Lucene benchmark config would not be
usable, or rather, it would need to simply point to a Solr
solrconfig.xml file. Other than that, the resulting statistical
reporting should be useful.

Jason

On Mon, Jun 14, 2010 at 8:57 AM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi all,

 Posted this to the Solr users list and after a week with no responses,
 thought I would try the dev list.

 We are about to test out various factors to try to speed up our indexing
 process.  One set of experiments will try various maxRamBufferSizeMB
  settings.   Since the factors we will be varying are at the Lucene level,
 we are considering using the Lucene Benchmark utilities in Lucene/contrib.
 Have other Solr users used Lucene Benchmark?  Can anyone provide any hints
 for adapting it to Solr? (Are there any common gotchas etc?).

 Tom

 Tom Burton-West
 University of Michigan Libraries
 http://www.hathitrust.org/blogs/large-scale-search


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Benchmarking Solr indexing using Lucene Benchmark?

2010-06-14 Thread Burton-West, Tom
Hi all,

Posted this to the Solr users list and after a week with no responses, thought 
I would try the dev list.

We are about to test out various factors to try to speed up our indexing 
process.  One set of experiments will try various maxRamBufferSizeMB  settings. 
  Since the factors we will be varying are at the Lucene level, we are 
considering using the Lucene Benchmark utilities in Lucene/contrib.Have 
other Solr users used Lucene Benchmark?  Can anyone provide any hints for 
adapting it to Solr? (Are there any common gotchas etc?).

Tom

Tom Burton-West
University of Michigan Libraries
http://www.hathitrust.org/blogs/large-scale-search



questions about DocsEnum.read()in flex api

2010-04-30 Thread Burton-West, Tom
I'm a bit confused about the DocsEnum.read() in the flex API.   I have three 
questions:

1)  DocsEnum.read() currently delegates to nextDoc() in the base class and 
there is a note that subclasses may do this more efficiently.  Is there 
currently a more efficient implementation in a subclass?  I didn't see one in 
MultiDocsEnum or MappingMultiDocsEnum, but perhaps I'm not understanding the 
code.

2)  DocsEnum.read reads 64 docs/freqs at a time as set up in 
initBulkResult().  Would it make sense to have this configurable as an argument 
somewhere?   I'm looking at very large indexes where a common term might occur 
in 100,000 or more docs.
3)  At the very top of the JavaDoc there is a warning you must first call 
nextDoc   It seems that this applies to calling DocsEnum.docID() or 
DocsEnum.freq() but not to DocsEnum.read().  Is that correct?

Tom Burton-West




RE: questions about DocsEnum.read()in flex api

2010-04-30 Thread Burton-West, Tom
Thanks Mike!

A follow-up question: 

 DocsEnum.read() currently delegates to nextDoc() in the base class and there
 is a note that subclasses may do this more efficiently.  Is there currently
 a more efficient implementation in a subclass?  
Yes, the standard codec does so (StandardPostingsReaderImpl.java).

I assume that the standard codec is the default.  Will what I'm using in 
HighFreqTermsWithTF to instantiate an IndexReader (below) eventually end up 
instantiating the StandardPostingReaderImpl or do I need to do something 
explicitly that will cause it to be instantiated?
 
dir = FSDirectory.open(new File(args[0]));
reader = IndexReader.open(dir, true); 

Tom

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Fix to contrib/misc/HighFreqTerms.java

2010-04-16 Thread Burton-West, Tom
Hi Mike,

Thanks for making the fix and changing the display from bytes to utf8.  It 
needs a very minor change:
The latest fix converts to utf8 if you give a field argument on the command 
line but still shows bytes if you don't.

Line 89 should parallel line 70 and use term.utf8ToString() instead of 
term.toString;

70   tiq.insertWithOverflow(new TermInfo(new Term(field, 
term.utf8ToString()), termsEnum.docFreq()));
89   tiq.insertWithOverflow(new TermInfo(new Term(field, term.toString()), 
terms.docFreq()));

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, April 14, 2010 3:50 PM
To: java-dev@lucene.apache.org
Subject: Re: Bug in contrib/misc/HighFreqTerms.java?

OK I committed the fix.  I ran it on a flex wikipedia index I had...
it produces output like this:

body:[3c 21 2d 2d] 509050
body:[73 68 6f 75 6c 64] 515495
body:[74 68 65 6e] 525176
body:[74 69 74 6c 65] 525361
body:[5b 5b 55 6e 69 74 65 64] 532586
body:[6b 6e 6f 77 6e] 533558
body:[75 6e 64 65 72] 536480
body:[55 6e 69 74 65 64] 543746

Which is not very readable, but, it does this because flex terms are
arbitrary byte[], not necessarily utf8... maybe we should fix it to
print both hex and String if we assume bytes are utf8?

Mike

On Wed, Apr 14, 2010 at 3:25 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Ugh, I'll fix this.

 With the new flex API, you can't ask a composite (Multi/DirReader) for
 its postings -- you have to go through the static methods on
 MultiFields.  I'm trying to put some distance b/w IndexReader and
 composite readers... because I'd like to eventually deprecate them.
 Ie, the composite readers should hold an ordered collection of
 sub-readers, but should not themselves implement IndexReader's API, I
 think.

 Thanks for raising this Tom,

 Mike

 On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom tburt...@umich.edu wrote:
 When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the
 the exception appended below.  I believe the line of code involved is a
 result of the flex indexing merge. Should I post this as a comment to
 LUCENE-2370 (Reintegrate flex branch into trunk)?

 Or is there simply something wrong with my configuration?

 Exception in thread main java.lang.UnsupportedOperationException: please
 use MultiFields.getFields if you really need a top level Fields (NOTE that
 it's usually better to work per segment instead)
     at
 org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
     at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)

 Tom Burton-West



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Bug in contrib/misc/HighFreqTerms.java?

2010-04-14 Thread Burton-West, Tom
When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the the 
exception appended below.  I believe the line of code involved  is a result of 
the flex indexing merge. Should I post this as a comment to LUCENE-2370 
(Reintegrate flex branch into trunk)?

Or is there simply something wrong with my configuration?

Exception in thread main java.lang.UnsupportedOperationException: please use 
MultiFields.getFields if you really need a top level Fields (NOTE that it's 
usually better to work per segment instead)
at 
org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)

Tom Burton-West



Solr BufferedTokenStream and new Lucene 2.9 TokenStream API

2009-07-24 Thread Burton-West, Tom
Hello all,

Would it be appropriate to open a JIRA issue to get converting the Solr 
BufferedTokenStream class to use the new Lucene 2.9 token API on the todo list 
?  Alternatively  is there a more general issue already open regarding Solr 
filters and the new API?   (I couldn't find one)  Or is it better to wait until 
the Lucene 2.9 API becomes final 
(https://issues.apache.org/jira/browse/LUCENE-1693) before opening a JIRA issue?

Tom Burton-West



How to contribute question (patch against release or latest trunk?)

2009-06-19 Thread Burton-West, Tom
Hello,

I read the How to Contribute page on the wiki and want to make a patch.  Do I 
make the patch against the latest Solr trunk or against the last release?

Tom


Tests fail for solrj.embedded on windows (Release 78676 and 775664 )

2009-06-19 Thread Burton-West, Tom
Hello all,

About every other time I check-out a current version of trunk and run the 
tests, the tests for solrj.embedded.* fail.  I'm running under windows XP with  
java version 1.6.0_13
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)

With the latest release  786676, I get these two failure messages:

[junit] Running org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.188 sec
 [junit] Test org.apache.solr.client.solrj.embedded.MergeIndexesEmbeddedTest 
FAILED
 [junit] Running org.apache.solr.client.solrj.embedded.MultiCoreEmbeddedTest
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.187 sec
 [junit] Test org.apache.solr.client.solrj.embedded.MultiCoreEmbeddedTest FAILED

I previously had failures with Release 775664 for 
org.apache.solr.client.solrj.embedded.SolrExampleStreamingTest
With a slightly earlier version of the jdk

http://issues.apache.org/jira/browse/SOLR-1014?focusedCommentId=12710502#action_12710502

Is there some magic setting, environment variable or junit version that I am 
missing?
What is the recommended workaround?

Tom Burton-West



How to Contribute question

2009-04-21 Thread Burton-West, Tom
Hello,

I read the How to Contribute document on the wiki. 
(http://wiki.apache.org/solr/HowToContribute#head-385f123f540367646df16825ca043d0098b31365)

I have written a custom analyzer https://issues.apache.org/jira/browse/SOLR-908
and would like to create a patch as documented in the wiki.

My question is where should I put my files in the source tree to generate the 
patch?  Should they go in trunk/contribute/mycode or 
src/java/org/apache/solr/analysis? and /src/test/org/apache/solr/analysis?

Tom Burton-West
tburt...@umich.edu