Re: issue with automatic iterable detection?
Indeed, this is why I put that assertion there :-) It's a bit of guesswork what all the possibilities are there. I'll add support for arrays there. Andi.. On Thu, 3 Mar 2011, Bill Janssen wrote: This looks like a problem. This is with an svn checkout of branch_3x. Bill 122, in _run_module_as_main __main__, fname, loader, pkg_name) File /usr/lib/python2.6/runpy.py, line 34, in _run_code exec code in run_globals File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py, line 98, in module cpp.jcc(sys.argv) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 548, in jcc addRequiredTypes(cls, typeset, generics) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 233, in addRequiredTypes addRequiredTypes(cls, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 238, in addRequiredTypes addRequiredTypes(ta, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 240, in addRequiredTypes raise NotImplementedError, repr(cls) NotImplementedError: Type: double[] %
Re: issue with automatic iterable detection?
On Thu, 3 Mar 2011, Bill Janssen wrote: Did a fresh checkout and here's the next issue. This one may be harder to fix... No, it's just another one of these Type classes, WildcardType. I should have a fix shortly. Sorry for the mess. Andi.. Bill Traceback (most recent call last): File /usr/lib/python2.6/runpy.py, line 122, in _run_module_as_main __main__, fname, loader, pkg_name) File /usr/lib/python2.6/runpy.py, line 34, in _run_code exec code in run_globals File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py, line 98, in module cpp.jcc(sys.argv) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 551, in jcc addRequiredTypes(cls, typeset, generics) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 233, in addRequiredTypes addRequiredTypes(cls, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 238, in addRequiredTypes addRequiredTypes(ta, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 238, in addRequiredTypes addRequiredTypes(ta, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 243, in addRequiredTypes raise NotImplementedError, repr(cls) NotImplementedError: Type: ?
Re: issue with automatic iterable detection?
On Thu, 3 Mar 2011, Bill Janssen wrote: Andi Vajda va...@apache.org wrote: Bill, Did that solve your problem ? Hmmm, I'm still seeing it. And some other stuff: Could you please send me the Java code that triggers this ? Andi.. build/_GoodStuff/__wrap03__.cpp: In function ?PyObject* com::parc::goodstuff::relations::t_Something1SeqIterator_nextElement(com::parc::goodstuff::relations::t_Something1SeqIterator*, PyObject*)?: build/_GoodStuff/__wrap03__.cpp:9122: error: ?class com::parc::goodstuff::relations::t_Something1SeqIterator? has no member named ?parameters? build/_GoodStuff/__wrap03__.cpp:9122: error: ?class com::parc::goodstuff::relations::t_Something1SeqIterator? has no member named ?parameters? build/_GoodStuff/__wrap03__.cpp: In function ?PyObject* com::parc::goodstuff::family::t_Something2Iterator_nextElement(com::parc::goodstuff::family::t_Something2Iterator*, PyObject*)?: build/_GoodStuff/__wrap03__.cpp:15376: error: ?class com::parc::goodstuff::family::t_Something2Iterator? has no member named ?parameters? build/_GoodStuff/__wrap03__.cpp:15376: error: ?class com::parc::goodstuff::family::t_Something2Iterator? has no member named ?parameters? build/_GoodStuff/__wrap03__.cpp: At global scope: build/_GoodStuff/__wrap03__.cpp:27749: error: ?t_JArray? was not declared in this scope build/_GoodStuff/__wrap03__.cpp:27749: error: parse error in template argument list build/_GoodStuff/__wrap03__.cpp:27749: error: insufficient contextual information to determine type build/_GoodStuff/__wrap03__.cpp:27749: warning: ?? operator will be treated as two right angle brackets in C++0x build/_GoodStuff/__wrap03__.cpp:27749: warning: suggest parentheses around ?? expression build/_GoodStuff/__wrap03__.cpp:27749: error: spurious ??, use ?? to terminate a template argument list build/_GoodStuff/__wrap03__.cpp:27749: error: expected primary-expression before ?)? token build/_GoodStuff/__wrap03__.cpp:27749: error: too many initializers for ?PyTypeObject? build/_GoodStuff/__wrap03__.cpp:41430: error: ?t_JArray? was not declared in this scope build/_GoodStuff/__wrap03__.cpp:41430: error: parse error in template argument list build/_GoodStuff/__wrap03__.cpp:41430: error: insufficient contextual information to determine type build/_GoodStuff/__wrap03__.cpp:41430: error: expected primary-expression before ?)? token build/_GoodStuff/__wrap03__.cpp:41430: error: too many initializers for ?PyTypeObject? error: command 'gcc' failed with exit status 1 I think when I tried it this afternoon (I was running out the door and kind of rushed) I just did a wrap, and not a --build. Sorry about that. Bill Andi.. On Feb 28, 2011, at 20:05, Andi Vajda va...@apache.org wrote: On Sun, 27 Feb 2011, Bill Janssen wrote: Andi Vajda va...@apache.org wrote: It may be simplest if you can send me the source file for this class as well as a small jar file I can use to reproduce this ? Turns out to be simple to reproduce. Put the attached in a file called test.java, and run this sequence: % javac -classpath . test.java % jar cf test.jar *.class % python -m jcc.__main__ --python test --shared --jar /tmp/test.jar --build --vmarg -Djava.awt.headless=true This was a tougher one. It was triggered by a combination of things: - no wrapper requested for java.io.File or --package java.io - a subclass of a parameterized class or interface implementor of a parameterized interface wasn't pulling in classes used as type parameters (java.io.File here). A fix is checked into jcc trunk/branch_3x rev 1075642. This also includes the earlier fix about using absolute class names. Andi..
Using JCC / PyLucene with JEPP?
New topic. I'd like to wrap my UpLib codebase, which is Python using PyLucene, in Java using JEPP (http://jepp.sourceforge.net/), so that I can use it with Tomcat. Now, am I going to have to do some trickery to get a VM? Or will getVMEnv() just work with a previously initialized JVM? Bill
Re: issue with automatic iterable detection?
Here's one of the generated lines that's causing me grief. DECLARE_TYPE(RankIterator, t_RankIterator, ::java::lang::Object, RankIterator, t_RankIterator_init_, PyObject_SelfIter, ((PyObject *(*)(t_RankIterator *)) get_nextt_RankIterator,t_JArray jint ,JArray jint ), t_RankIterator__fields_, 0, 0); It yields this: build/_PPD/__wrap02__.cpp:27284: error: ‘t_JArray’ was not declared in this scope build/_PPD/__wrap02__.cpp:27284: error: parse error in template argument list build/_PPD/__wrap02__.cpp:27284: error: insufficient contextual information to determine type build/_PPD/__wrap02__.cpp:27284: warning: ‘’ operator will be treated as two right angle brackets in C++0x build/_PPD/__wrap02__.cpp:27284: warning: suggest parentheses around ‘’ expression build/_PPD/__wrap02__.cpp:27284: error: spurious ‘’, use ‘’ to terminate a template argument list build/_PPD/__wrap02__.cpp:27284: error: expected primary-expression before ‘)’ token build/_PPD/__wrap02__.cpp:27284: error: too many initializers for ‘PyTypeObject’ Where does t_JArray get defined? I can't find it. Bill
Re: issue with automatic iterable detection?
On Thu, 3 Mar 2011, Andi Vajda wrote: On Mar 3, 2011, at 22:09, Bill Janssen jans...@parc.com wrote: Here's one of the generated lines that's causing me grief. DECLARE_TYPE(RankIterator, t_RankIterator, ::java::lang::Object, RankIterator, t_RankIterator_init_, PyObject_SelfIter, ((PyObject *(*)(t_RankIterator *)) get_nextt_RankIterator,t_JArray jint ,JArray jint ), Ah yes, that's invalid c++. Nested generics need to insert a space between ''. Otherwise, the c++ parser gets it as the bit shifting operator, believe it or not. Should be easy enough to fix in jcc. Fixed in trunk/branch_3x rev 1077828. Andi.. Andi.. t_RankIterator__fields_, 0, 0); It yields this: build/_PPD/__wrap02__.cpp:27284: error: ?t_JArray? was not declared in this scope build/_PPD/__wrap02__.cpp:27284: error: parse error in template argument list build/_PPD/__wrap02__.cpp:27284: error: insufficient contextual information to determine type build/_PPD/__wrap02__.cpp:27284: warning: ?? operator will be treated as two right angle brackets in C++0x build/_PPD/__wrap02__.cpp:27284: warning: suggest parentheses around ?? expression build/_PPD/__wrap02__.cpp:27284: error: spurious ??, use ?? to terminate a template argument list build/_PPD/__wrap02__.cpp:27284: error: expected primary-expression before ?)? token build/_PPD/__wrap02__.cpp:27284: error: too many initializers for ?PyTypeObject? Where does t_JArray get defined? I can't find it. Bill
[jira] Commented: (SOLR-1489) A UTF-8 character is output twice (Bug in Jetty)
[ https://issues.apache.org/jira/browse/SOLR-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001909#comment-13001909 ] Jun Ohtani commented on SOLR-1489: -- Sekiguchi-san, I checked bufsize [510-512]. only saw B once. Maybe, it is OK. A UTF-8 character is output twice (Bug in Jetty) Key: SOLR-1489 URL: https://issues.apache.org/jira/browse/SOLR-1489 Project: Solr Issue Type: Bug Environment: Jetty-6.1.3 Jetty-6.1.21 Jetty-7.0.0RC6 Reporter: Jun Ohtani Assignee: Koji Sekiguchi Priority: Critical Attachments: SOLR-1489.patch, error_utf8-example.xml, jetty-6.1.22.jar, jetty-util-6.1.22.jar, jettybugsample.war, jsp-2.1.zip, servlet-api-2.5-20081211.jar A UTF-8 character is output twice under particular conditions. Attach the sample data.(error_utf8-example.xml) Registered only sample data, click the following URL. http://localhost:8983/solr/select?q=*%3A*version=2.2start=0rows=10omitHeader=truefl=attr_jsonwt=json Sample data is only B, but response is BB. When wt=phps, error occurs in PHP unsrialize() function. This bug is like a bug in Jetty. jettybugsample.war is the simplest one to reproduce the problem. Copy example/webapps, and start Jetty server, and click the following URL. http://localhost:8983/jettybugsample/filter/hoge Like earlier, B is output twice. Sysout only B once. I have tested this on Jetty 6.1.3 and 6.1.21, 7.0.0rc6. (When testing with 6.1.21or 7.0.0rc6, change bufsize from 128 to 512 in web.xml. ) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Solr-trunk - Build # 1428 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1428/ All tests passed Build Log (for compile errors): [...truncated 14806 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Boost function problem with disquerymax
You are right. it was not and index field. just stored Thanx 2011/3/2 Yonik Seeley yo...@lucidimagination.com On Wed, Mar 2, 2011 at 11:34 AM, Gastone Penzo gastone.pe...@gmail.com wrote: HI, for search i use disquery max and a i want to boost a field with bf parameter like: ...bf=boost_has_img^5 the boost_has_img field of my document is 3: int name=boost_has_img3/int if i see the results in debug query mode i can see: 0.0 = (MATCH) FunctionQuery(int(boost_has_img)), product of: 0.0 = int(boost_has_img)=0 5.0 = boost 0.06543833 = queryNorm why the score is 0 if the value is 3 and the boost is 5??? Solr thinks the value of boost_has_img is 0 for that document. Is boost_has_img an indexed field? If so, verify that the value is actually 3 for that specific document. -Yonik http://lucidimagination.com -- Gastone Penzo Webster Srl www.webster.it www.libreriauniversitaria.it
perfect match in dismax search
How to obtain perfect match with dismax query?? es: i want to search hello i love you with deftype=dismax in the title field and i want to obtain results which title is exactly hello i love you with all this terms in this order. Not less words or other. how is it possilbe?? i tryed with +(hello i love you) but if i have a title which is hello i love you mum it matches and i don't want! Thanx -- Gastone Penzo Webster Srl www.webster.it www.libreriauniversitaria.it
[jira] Resolved: (SOLR-1489) A UTF-8 character is output twice (Bug in Jetty)
[ https://issues.apache.org/jira/browse/SOLR-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi resolved SOLR-1489. -- Resolution: Fixed Fix Version/s: 4.0 3.1 Marking resolved as duplicate of SOLR-2381. A UTF-8 character is output twice (Bug in Jetty) Key: SOLR-1489 URL: https://issues.apache.org/jira/browse/SOLR-1489 Project: Solr Issue Type: Bug Environment: Jetty-6.1.3 Jetty-6.1.21 Jetty-7.0.0RC6 Reporter: Jun Ohtani Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.1, 4.0 Attachments: SOLR-1489.patch, error_utf8-example.xml, jetty-6.1.22.jar, jetty-util-6.1.22.jar, jettybugsample.war, jsp-2.1.zip, servlet-api-2.5-20081211.jar A UTF-8 character is output twice under particular conditions. Attach the sample data.(error_utf8-example.xml) Registered only sample data, click the following URL. http://localhost:8983/solr/select?q=*%3A*version=2.2start=0rows=10omitHeader=truefl=attr_jsonwt=json Sample data is only B, but response is BB. When wt=phps, error occurs in PHP unsrialize() function. This bug is like a bug in Jetty. jettybugsample.war is the simplest one to reproduce the problem. Copy example/webapps, and start Jetty server, and click the following URL. http://localhost:8983/jettybugsample/filter/hoge Like earlier, B is output twice. Sysout only B once. I have tested this on Jetty 6.1.3 and 6.1.21, 7.0.0rc6. (When testing with 6.1.21or 7.0.0rc6, change bufsize from 128 to 512 in web.xml. ) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001967#comment-13001967 ] Michael McCandless commented on LUCENE-2822: I think we should stick with our private timer thread (and we should definitely make it stop-able). I've seen too many problems associated with relying on the system's time for important things like timing out queries, eg when daylight savings time strikes, or the clock is being aggressively corrected, and suddenly a bunch of queries are truncated. In theory System.nanoTime should be immune to this (it's the system's timer and not any notion of wall clock time), but in practice, I don't think we should risk it. TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2385) Backport latest /browse improvements to branch_3x
[ https://issues.apache.org/jira/browse/SOLR-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001969#comment-13001969 ] Jan Høydahl commented on SOLR-2385: --- I classified SOLR-2383 as a Bug, not a feature, because most people downloading Solr3.1 will start customizing facets and get puzzled when the range facet still read Price ($) and that their own facets do not show up. I'm sure this will generate a bunch of traffic on the mailing lists. Backport latest /browse improvements to branch_3x - Key: SOLR-2385 URL: https://issues.apache.org/jira/browse/SOLR-2385 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 3.1 Reporter: Jan Høydahl Assignee: Grant Ingersoll Labels: velocity Fix For: 3.1 Attachments: SOLR-2385.patch, SOLR-2385.patch There are a lot of improvements in TRUNK Velocity GUI which will work well even for 3.1 -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2383) Velocity: Generalize range and date facet display
[ https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001970#comment-13001970 ] Jan Høydahl commented on SOLR-2383: --- Would appreciate if someone could test out this patch on your own data and test various combinations of facet.range and gaps to see if it is water tight Velocity: Generalize range and date facet display - Key: SOLR-2383 URL: https://issues.apache.org/jira/browse/SOLR-2383 Project: Solr Issue Type: Bug Components: Response Writers Reporter: Jan Høydahl Labels: facet, range, velocity Attachments: SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch Velocity (/browse) GUI has hardcoded price range facet and a hardcoded manufacturedate_dt date facet. Need general solution which work for any facet.range and facet.date. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: wind down for 3.1?
Hello all, Is there any update on the 3.1 status? I'm really looking forward to it :) Regards, Sanne 2011/2/16 Chris Hostetter hossman_luc...@fucit.org: : 1. javadocs warnings/errors: this is a constant battle, its worth : considering if the build should actually fail if you get one of these, : in my opinion if we can do this we really should. its frustrating to for a brief period we did, and then we rolled it back... https://issues.apache.org/jira/browse/LUCENE-875 : 2. introducing new compiler warnings: another problem just being left : for someone else to clean up later, another constant losing battle. : 99% of the time (for non-autogenerated code) the warnings are : useful... in my opinion we should not commit patches that create new : warnings. it's hard to spot new compiler warnings when there are already so many ... if we can get down to 0 then we can add hacks to make hte build fail if someone adds 1 but until then we have an uphill battle. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: wind down for 3.1?
On Thu, Mar 3, 2011 at 7:43 AM, Sanne Grinovero sanne.grinov...@gmail.com wrote: Hello all, Is there any update on the 3.1 status? I'm really looking forward to it :) Yes, we are currently in the feature freeze, but it seems to be coming in shape. I'm planning on creating the release branch this weekend and getting our first RC out Sunday (Steven Rowe volunteered to help with the maven side, thanks!). If you want to help, for example you can checkout the lucene code from http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/ then you can run 'ant clean dist dist-src' and inspect the artifacts it puts in the dist/ folder and report any problems. If everyone waits until we build an RC before reviewing how things look and reporting problems, its going to significantly slow down the release process as generating RC's for both lucene and solr at the moment is nontrivial (which is why Steven and I have set aside this day to try to build RC1, if the vote doesn't pass it might be weeks before we have the time to build RC2). - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: wind down for 3.1?
2011/3/3 Robert Muir rcm...@gmail.com: On Thu, Mar 3, 2011 at 7:43 AM, Sanne Grinovero sanne.grinov...@gmail.com wrote: Hello all, Is there any update on the 3.1 status? I'm really looking forward to it :) Yes, we are currently in the feature freeze, but it seems to be coming in shape. I'm planning on creating the release branch this weekend and getting our first RC out Sunday (Steven Rowe volunteered to help with the maven side, thanks!). If you want to help, for example you can checkout the lucene code from http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/ then you can run 'ant clean dist dist-src' and inspect the artifacts it puts in the dist/ folder and report any problems. If everyone waits until we build an RC before reviewing how things look and reporting problems, its going to significantly slow down the release process as generating RC's for both lucene and solr at the moment is nontrivial (which is why Steven and I have set aside this day to try to build RC1, if the vote doesn't pass it might be weeks before we have the time to build RC2). Cheers, thanks a lot. I'm definitely testing it often, and will report anything weird. I can't say about Solr though as we use Lucene mostly. Sanne - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002002#comment-13002002 ] Robert Muir commented on LUCENE-2822: - bq. I think we should stick with our private timer thread (and we should definitely make it stop-able). And no private thread should start in the static initializer... its fine for all instances to share a single private timer thread but this should be lazy-loaded. TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [Lucene.Net] CI Task Update: Hudkins
Sorry I've been just reading through the list wiki ( http://wiki.apache.org/general/Hudson) I subscribed to the list last night and haven't received the usual message for activating subscriptions, I'll try again today and get on the list to see what is already on the windows slave. We also have to ask if others are interested in tools that not installed and if not install them under home/username. - michael. On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote: I've been following builds@ for the past couple of days. Looks like they just finished the migration to Jenkins. Michael - Have you had a chance to contact them and find out what tools are available out of our list? Want me to do that? Thanks, Troy On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net wrote: +1 Scott On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org wrote: On 2011-02-28, Troy Howard wrote: One quick concern I have, is how much of the things listed are already available on the Apache hudson server? builds@apache is the place to ask. A lot of this is .NET specific, so unlikely that it will already be available. well, the DotCMIS build seems to be using Sandcastle Helpfile Builder by looking the console output. We'll have to request that ASF Infra team install these tools for us, and they may not agree, or there might be licensing issues, etc.. Not sure. I'd start the conversation with them now to suss this out. Really, go to the builds list. License issues usually don't show up for build tools. It may be good if anybody of the team could volunteer time helping administrate the Windows slave. - Mono is going to be a requirement moving forward This could be done on a non-Windows slave just to completely sure it works. This may require installing a newer Mono (or just pulling in in a different Debian package source for Mono) than is installed by default. - Project structure was being discussed on the LUCENENET-377 thread. As a quick note, in general we prefer the mailing list of JIRA for discussions around the ASF. Stefan
[jira] Created: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002019#comment-13002019 ] Robert Muir commented on LUCENE-2947: - Hi Dave, in my opinion there are a lot of problems with our current NGramTokenizer (yours is just one) and it would be a good idea to consider creating a new one. We could rename the old one to ClassicNGramTokenizer or something for people that need the backwards compatibility. A lot of the problems already have open JIRA issues: i gave my opinion on some of the broken-ness here: LUCENE-1224 . The largest problem being that these tokenizers only examine the first 1024 chars of the document. They shouldn't just discard anything after 1024 chars. There is no need to load the 'entire document' into memory... n-gram tokenization can work on a sliding window across the document. In my opinion part of n-gram character tokenization is being able to configure what is a token character and what is not. (Note I don't mean java character here, but in the more abstract sense, e.g. a character might have diacritics and be treated as a single unit). For some applications maybe this is just 'alphabetic letters', for other apps perhaps even punctuation could be considered 'relevant'. So it should somehow be flexible. Furthermore, in the case of word-spanning n-grams, you should be able to collapse runs of Non-characters into a single marker (e.g. _), and usually you would want to do this for the start and end of string too. here's visual representation of how things should look when you use these tokenizers in my opinion: http://www.csee.umbc.edu/~nicholas/601/SIGIR08-Poster.pdf NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Byrne updated LUCENE-2947: Attachment: NGramTokenizerTest.java A simple failing JUnit test illustrating the problem. NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Attachments: NGramTokenizerTest.java Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002026#comment-13002026 ] David Byrne commented on LUCENE-2947: - Thanks for the feedback Robert. I'll give it a shot and try and write a new one. I wanted to write a tokenizer to support skip-grams anyways. NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Attachments: NGramTokenizerTest.java Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002031#comment-13002031 ] Robert Muir commented on LUCENE-2947: - Thank you... by the way if you want to do skip-grams as a separate tokenizer or whatever, you know whatever makes sense... I could imagine some of the n-gram variations might need to be their own tokenizers to prevent things from being too complicated but perhaps they could still share some code. (But maybe you have some way to fit skipgrams in there easily) NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Attachments: NGramTokenizerTest.java Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002035#comment-13002035 ] David Byrne commented on LUCENE-2947: - Yeah I was originally planning to implement skip-grams as a seperate tokenizer. Since we are re-evaluating ngram tokenization in general, maybe I can come up with an elegant solution. Support for positional ngrams is another thing to consider. NGramTokenizer shouldn't trim whitespace Key: LUCENE-2947 URL: https://issues.apache.org/jira/browse/LUCENE-2947 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 3.0.3 Reporter: David Byrne Priority: Minor Attachments: NGramTokenizerTest.java Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: LUCENE-2948.patch Initial patch. This is a checkpoint of work-in-progress -- all tests pass, but there are zillions of nocommits to be resolved... Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2948) Make var gap terms index a partial prefix trie
Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2381) The included jetty server does not support UTF-8
[ https://issues.apache.org/jira/browse/SOLR-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002069#comment-13002069 ] Uwe Schindler commented on SOLR-2381: - Ok, thanks for reporting back. So there was maybe a problem in the past with XMLWriter, which is solved with Lucene trunk. Can you also check branch_3x (Lucene 3.1), because this is the next release and trunk (Lucene 4.0) is very unstable. The included jetty server does not support UTF-8 Key: SOLR-2381 URL: https://issues.apache.org/jira/browse/SOLR-2381 Project: Solr Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: SOLR-2381.patch, SOLR-ServletOutputWriter.patch, jetty-6.1.26-patched-JETTY-1340.jar, jetty-util-6.1.26-patched-JETTY-1340.jar Some background here: http://www.lucidimagination.com/search/document/6babe83bd4a98b64/which_unicode_version_is_supported_with_lucene Some possible solutions: * wait and see if we get resolution on http://jira.codehaus.org/browse/JETTY-1340. To be honest, I am not even sure where jetty is being maintained (there is a separate jetty project at eclipse.org with another bugtracker, but the older releases are at codehaus). * include a patched version of jetty with correct utf-8, using that patch. * remove jetty and include a different container instead. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002071#comment-13002071 ] Uwe Schindler commented on LUCENE-2822: --- bq. I think we should stick with our private timer thread (and we should definitely make it stop-able). I think this is still the best reason, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive. But nanoTime() has no wallclock problems, thats true, but is still a no-go for every collected hit! TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2948: --- Attachment: LUCENE-2948.patch New patch -- changes nextPossiblePrefix to return a SeekStatus. Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Maven Lucene Plugin : Contribution
Hi, I have released an open source project maven-lucene-plugin https://sourceforge.net/projects/lucene-plugin/ hosted at SourceForge. Its a maven plugin with which index can be created (without writing any code ) from a file source. The structure of the index can be defined in a file lucene.xml. I have also created a dependency which provided easy to use methods which work on the same index created by the maven-lucene-plugin (using the same lucene.xml). This plugin removes the need of knowledge of lucene's API to use Lucene. The full documentation about the plugin can be found here http://xebee.xebia.in/2011/02/28/maven-lucene-plugin/. The plugin is available in the Central Maven Repository http://repo1.maven.org/maven2/com/ and can also be browsed here https://oss.sonatype.org/index.html#nexus-search;quick%7Emaven-lucene-plugin. I would highly appreciate your feedback on the plugin. Along with some suggestions on how to improve it. As this is only the first version, we have lot to develop in the plugin. If you are interested in contributing to the plugin, please write to me. That would be very helpful. If Apache wants, I would be glad to donate the plugin to Apache to make it more stronger. Right now, the source code is on Sourceforge (here https://lucene-plugin.svn.sourceforge.net/svnroot/lucene-plugin/trunk/maven-lucene-plugin/) and the artifacts are on Central Maven Repository (here http://repo1.maven.org/maven2/com/xebia/). Thanks and Regards, Paritosh Ranjan
Re: Unintuitive NGramTokenizer behavior
On Mar 3, 2011, at 9:36 AM, David Byrne wrote: I have a minor quibble about Lucene's NGramTokenizer. Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. I'm not following this argument. What does the extra whitespace give you here? The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data I'm confused. If you are padding them up front, then why don't you just do the arbitrary char trick then? Where is the extra processing? This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. It may be undocumented, but I think it is pretty standard as to what users expect out of a tokenizer. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. -Grant
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002115#comment-13002115 ] Mark Harwood commented on LUCENE-2822: -- FYI - I visited a site today using Lucene 1720 live on a large index (2 billion docs, sharded with 5 minute update intervals). They haven't noticed any significant degrading of search performance as a result of using this approach. TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Unintuitive NGramTokenizer behavior
On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll gsing...@apache.org wrote: Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. Grant I'm sorry, but I couldnt disagree more. There are many variations on ngram tokenization (word-internal, word-spanning, skipgrams), besides allowing flexibility for what should be a word character and what should not be (e.g. punctuation), and how to handle the specifics of these. But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons: 1. it discards anything after the first 1024 code units of the document. 2. it uses partial characters (UTF-16 code units) as its fundamental measure, potentially creating lots of invalid unicode. 3. it forms n-grams in the wrong order, contributing to #1. I explained this in LUCENE-1224 Its these reasons that I suggested we completely rewrite it... people that are just indexing english documents with 1024 chars per document and don't care about these things can use ClassicNGramTokenizer. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [Lucene.Net] how to add a new record to existing index
I don't think that I understand your problem. Is it something like IndexWriter writer = new IndexWriter(path, analyzer, *false*, IndexWriter.DEFAULT_MAX_FIELD_LENGTH); .. writer.AddDocument(doc); DIGY -Original Message- From: Wen Gao [mailto:samuel.gao...@gmail.com] Sent: Thursday, March 03, 2011 1:43 AM To: lucene-net-...@lucene.apache.org Subject: Re: [Lucene.Net] how to add a new record to existing index Hi Digy, It was my fault that i didnt say it clearly. I mean I have created an index,but it is not updated real time. So I want to update the index everytime after I add data to database to keep the index up-to-date. My data is what user inputs and inserted to database.Then BTW, I know how to delete a term from index using IndexReader. Likewise, I want to write a term to the created index instead of creating a new index. I appreciate your time. Thanks, Wen 2011/3/2 Digy digyd...@gmail.com First of all, your code doesn't mean anything to me other than you add some fields to a document object. Also, I can't see what you mean with *existing* index. The directory you pass to the IndexWriter is the index you use and every document added(using IndexWriter's AddDocument) is written to that index. I think, we have problems in using a common terminology. DIGY PS: It would be better if you use user mailing list to ask questions. This mailing lists is intented to be for development purposes. -Original Message- From: Wen Gao [mailto:samuel.gao...@gmail.com] Sent: Wednesday, March 02, 2011 11:02 PM To: lucene-net-...@lucene.apache.org Subject: [Lucene.Net] how to add a new record to existing index Hi, I already have created an index, and I want to insert an index record to this existing index everytime I insert a new record to database. For example, if I want to insert an reord (l1, 15,tom, 20,2010/01/02) to my *existing* index, how can i do this? (I dont want to create a new index, which takes too much time) my format of index is as follows: /// doc.Add(new Lucene.Net.Documents.Field( lmname, readerreader1[lmname].ToString(), //new System.IO.StringReader(readerreader[cname].ToString()), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED) ); //lmid doc.Add(new Lucene.Net.Documents.Field( lmid, readerreader1[lmid].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // nick name of user doc.Add(new Lucene.Net.Documents.Field( nickName, readerreader1[nickName].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // uid doc.Add(new Lucene.Net.Documents.Field( uid, readerreader1[uid].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // acttime doc.Add(new Lucene.Net.Documents.Field( acttime, readerreader1[acttime].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); writer.AddDocument(doc); /// Thanks, Wen
Re: Unintuitive NGramTokenizer behavior
Grant, To explain the advantage: Trigrams for foo bar: 'foo', 'oo ', 'o b', ' ba', 'bar' Trigrams for bar foo: 'bar', 'ar ', 'r f', ' fo', 'foo' Only two out of eight unique trigrams match. Trigrams for foo bar : ' fo', 'foo', 'oo ', 'o b', ' ba', 'bar', 'ar ' Trigrams for bar foo : ' ba', 'bar', 'ar ', 'r f', ' fo', 'foo', 'oo ' Six out of eight unique trigrams match. I can't do the character replacement up front, because foo and bar might already contain whitespace as well. Anyways, its a hack, and if my arbitrary character ever gets introduced into the data I am in trouble. Not only is this undocumented, but it seems unintentional if you look at the comments in the code. FYI, I opened up an issue regarding this: http://bit.ly/eqhTO1 On Mar 3, 2011 1:00 PM, Grant Ingersoll gsing...@apache.org wrote: On Mar 3, 2011, at 9:36 AM, David Byrne wrote: I have a minor quibble about Lucene's NGramTokenizer. Before I tokenize my strings, I am padding them with white space: String foobar = + foo + + bar + ; When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. foo bar matches bar foo more closely than foo bar matches bar foo. I'm not following this argument. What does the extra whitespace give you here? The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: foobar.replaceAll( ,$); //arbitrary char not in my data I'm confused. If you are padding them up front, then why don't you just do the arbitrary char trick then? Where is the extra processing? This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. It may be undocumented, but I think it is pretty standard as to what users expect out of a tokenizer. I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. -Grant
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002129#comment-13002129 ] Uwe Schindler commented on LUCENE-2822: --- Mark: But LUCENE-1720 does not use a System.nanoTime()/System.currentTimeMillis(), so what is your comment about? TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2399) Solr Admin Interface, reworked
Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [Lucene.Net] CI Task Update: Hudkins
I'd ask if we need to install this stuff on hudson at all -- most of it is command line utilites that can be transported with the source in svn anyhow. Sidebar advantages here are it is much easier to debug the build scrips since you have no environmental dependencies and you've got all the toys in one download. On Thu, Mar 3, 2011 at 10:05 AM, Michael Herndon mhern...@o19s.com wrote: Sorry I've been just reading through the list wiki ( http://wiki.apache.org/general/Hudson) I subscribed to the list last night and haven't received the usual message for activating subscriptions, I'll try again today and get on the list to see what is already on the windows slave. We also have to ask if others are interested in tools that not installed and if not install them under home/username. - michael. On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote: I've been following builds@ for the past couple of days. Looks like they just finished the migration to Jenkins. Michael - Have you had a chance to contact them and find out what tools are available out of our list? Want me to do that? Thanks, Troy On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net wrote: +1 Scott On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org wrote: On 2011-02-28, Troy Howard wrote: One quick concern I have, is how much of the things listed are already available on the Apache hudson server? builds@apache is the place to ask. A lot of this is .NET specific, so unlikely that it will already be available. well, the DotCMIS build seems to be using Sandcastle Helpfile Builder by looking the console output. We'll have to request that ASF Infra team install these tools for us, and they may not agree, or there might be licensing issues, etc.. Not sure. I'd start the conversation with them now to suss this out. Really, go to the builds list. License issues usually don't show up for build tools. It may be good if anybody of the team could volunteer time helping administrate the Windows slave. - Mono is going to be a requirement moving forward This could be done on a non-Windows slave just to completely sure it works. This may require installing a newer Mono (or just pulling in in a different Debian package source for Mono) than is installed by default. - Project structure was being discussed on the LUCENENET-377 thread. As a quick note, in general we prefer the mailing list of JIRA for discussions around the ASF. Stefan
[jira] Updated: (LUCENE-2919) IndexSplitter that divides by primary key term
[ https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2919: - Attachment: LUCENE-2919.patch First cut. Roughly divides an index by the exclusive mid term given. IndexSplitter that divides by primary key term -- Key: LUCENE-2919 URL: https://issues.apache.org/jira/browse/LUCENE-2919 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-2919.patch Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Unintuitive NGramTokenizer behavior
On Thu, Mar 3, 2011 at 2:06 PM, Grant Ingersoll gsing...@apache.org wrote: On Mar 3, 2011, at 1:10 PM, Robert Muir wrote: On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll gsing...@apache.org wrote: Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. Grant I'm sorry, but I couldnt disagree more. There are many variations on ngram tokenization (word-internal, word-spanning, skipgrams), besides allowing flexibility for what should be a word character and what should not be (e.g. punctuation), and how to handle the specifics of these. But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons: 1. it discards anything after the first 1024 code units of the document. 2. it uses partial characters (UTF-16 code units) as its fundamental measure, potentially creating lots of invalid unicode. 3. it forms n-grams in the wrong order, contributing to #1. I explained this in LUCENE-1224 Sure, but those are ancillary to the whitespace question that was asked about. Not really? its the more general form of the whitespace question. I'm saying you should be able to say 'this is part of a word', but then also specify if you want to fold runs of non-characters into a single thing (e.g. '_') or into nothing at all, or whatever. Additionally NGramTokenizer should also support option to treat start and end of string as non-characters... in my opinion this should be the default and is the root cause of Dave's issue? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [Lucene.Net] CI Task Update: Hudkins
The licenses on some of the tools may not cover having them in svn. If there were free open source versions that allowed redistributing the binaries of all the tools listed that did the job well, we could put them into svn. However as far as I know there isn't a pure .net open source tool chain for some of the tools listed above. Thus it needs to be put on the slave or accessible from the slave with the limited licenses. Also one would want to install hudson plugins that generate reports and graphical information based on the xml documents generated during the build. Then have hudson process them on initial load and display them on the dashboard / build output. - Michael On Thu, Mar 3, 2011 at 1:57 PM, Wyatt Barnett wyatt.barn...@gmail.comwrote: I'd ask if we need to install this stuff on hudson at all -- most of it is command line utilites that can be transported with the source in svn anyhow. Sidebar advantages here are it is much easier to debug the build scrips since you have no environmental dependencies and you've got all the toys in one download. On Thu, Mar 3, 2011 at 10:05 AM, Michael Herndon mhern...@o19s.com wrote: Sorry I've been just reading through the list wiki ( http://wiki.apache.org/general/Hudson) I subscribed to the list last night and haven't received the usual message for activating subscriptions, I'll try again today and get on the list to see what is already on the windows slave. We also have to ask if others are interested in tools that not installed and if not install them under home/username. - michael. On Wed, Mar 2, 2011 at 5:19 PM, Troy Howard thowar...@gmail.com wrote: I've been following builds@ for the past couple of days. Looks like they just finished the migration to Jenkins. Michael - Have you had a chance to contact them and find out what tools are available out of our list? Want me to do that? Thanks, Troy On Mon, Feb 28, 2011 at 9:19 PM, Scott Lombard slomb...@theta.net wrote: +1 Scott On Mon, Feb 28, 2011 at 5:18 AM, Stefan Bodewig bode...@apache.org wrote: On 2011-02-28, Troy Howard wrote: One quick concern I have, is how much of the things listed are already available on the Apache hudson server? builds@apache is the place to ask. A lot of this is .NET specific, so unlikely that it will already be available. well, the DotCMIS build seems to be using Sandcastle Helpfile Builder by looking the console output. We'll have to request that ASF Infra team install these tools for us, and they may not agree, or there might be licensing issues, etc.. Not sure. I'd start the conversation with them now to suss this out. Really, go to the builds list. License issues usually don't show up for build tools. It may be good if anybody of the team could volunteer time helping administrate the Windows slave. - Mono is going to be a requirement moving forward This could be done on a non-Windows slave just to completely sure it works. This may require installing a newer Mono (or just pulling in in a different Debian package source for Mono) than is installed by default. - Project structure was being discussed on the LUCENENET-377 thread. As a quick note, in general we prefer the mailing list of JIRA for discussions around the ASF. Stefan -- Michael Herndon Senior Developer (mhern...@o19s.com) 804.767.0083 [connect online] http://www.opensourceconnections.com http://www.amptools.net http://www.linkedin.com/pub/michael-herndon/4/893/23 http://www.facebook.com/amptools.net http://www.twitter.com/amptools-net
[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie
[ https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2948: Attachment: LUCENE-2948_automaton.patch Nice work Mike, i think I found a bug with nextPossiblePrefix though? I attached my modifications to try to use this with Automaton. (just the automaton parts): I'm also somehow triggering BlockReader's assert about crossing over index terms with other tests... I think i could see the problem here... is it that nextPossiblePrefix(BytesRef prefix) means it wants me to truly pass in a prefix? obviously a consumer doesn't know which portion of his term is/isnt a prefix! So we would have to expose that :(, or alternatively change the semantics to nextPossiblePrefix(BytesRef term)? In other words, in this situation of 1[\u]234567891 it would simply return true, because it knows 1* exists rather than forwarding me to s? Maybe this is what was intended all along and its just an off by one? {noformat} [junit] NOTE: reproduce with: ant test -Dtestcase=TestFuzzyQuery -Dtestmethod=testTokenLengthOpt -Dtests.seed=4471452442745287654:-2341611255635429887 -Dtests.codec=Standard // NOTE: this index has two terms: 12345678911 and segment [junit] - Standard Output --- [junit] candidate: [\u]1234567891 [junit] not found, goto: 1 [junit] candidate: 1[\u]234567891 [junit] not found, goto: s --- this is the problem, because 12345678911 exists [junit] candidate: s1234567891 [junit] found! [junit] candidate: t1234567891 [junit] found! {noformat} Make var gap terms index a partial prefix trie -- Key: LUCENE-2948 URL: https://issues.apache.org/jira/browse/LUCENE-2948 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948_automaton.patch Var gap stores (in an FST) the indexed terms (every 32nd term, by default), minus their non-distinguishing suffixes. However, often times the resulting FST is close to a prefix trie in some portion of the terms space. By allowing some nodes of the FST to store all outgoing edges, including ones that do not lead to an indexed term, and by recording that this node is then authoritative as to what terms exist in the terms dict from that prefix, we can get some important benefits: * It becomes possible to know that a certain term prefix cannot exist in the terms index, which means we can save a disk seek in some cases (like PK lookup, docFreq, etc.) * We can query for the next possible prefix in the index, allowing some MTQs (eg FuzzyQuery) to save disk seeks. Basically, the terms index is able to answer questions that previously required seeking/scanning in the terms dict file. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [Lucene.Net] how to add a new record to existing index
There are several things to consider. The first is what DIGY pointed out. The third parameter of the IndexWriter constructor determines if the code is creating a new index or opening an existing index for additions. The code must specify false to open an existing index for additions. A second thing to consider is that the additions made to the index with writer.AddDocument() will not be visible until the IndexWriter is closed, or the Commit() method is called. A third thing to consider, instances of IndexReader can only see the content of the index at the time the IndexReader instance was opened. Even after the IndexWriter commits its changes, IndexReader instances must be re-opened in order to see the new index content. It seems you should check your code to ensure: - IndexWriter constructor is being called with the right parameters to open an existing index. - IndexWriter is closed or commit is called after changes have been made. - IndexReader instances are re-opened after changes have been committed. - Neal -Original Message- From: Digy [mailto:digyd...@gmail.com] Sent: Thursday, March 03, 2011 12:16 PM To: lucene-net-...@lucene.apache.org Subject: RE: [Lucene.Net] how to add a new record to existing index I don't think that I understand your problem. Is it something like IndexWriter writer = new IndexWriter(path, analyzer, *false*, IndexWriter.DEFAULT_MAX_FIELD_LENGTH); .. writer.AddDocument(doc); DIGY -Original Message- From: Wen Gao [mailto:samuel.gao...@gmail.com] Sent: Thursday, March 03, 2011 1:43 AM To: lucene-net-...@lucene.apache.org Subject: Re: [Lucene.Net] how to add a new record to existing index Hi Digy, It was my fault that i didnt say it clearly. I mean I have created an index,but it is not updated real time. So I want to update the index everytime after I add data to database to keep the index up-to-date. My data is what user inputs and inserted to database.Then BTW, I know how to delete a term from index using IndexReader. Likewise, I want to write a term to the created index instead of creating a new index. I appreciate your time. Thanks, Wen 2011/3/2 Digy digyd...@gmail.com First of all, your code doesn't mean anything to me other than you add some fields to a document object. Also, I can't see what you mean with *existing* index. The directory you pass to the IndexWriter is the index you use and every document added(using IndexWriter's AddDocument) is written to that index. I think, we have problems in using a common terminology. DIGY PS: It would be better if you use user mailing list to ask questions. This mailing lists is intented to be for development purposes. -Original Message- From: Wen Gao [mailto:samuel.gao...@gmail.com] Sent: Wednesday, March 02, 2011 11:02 PM To: lucene-net-...@lucene.apache.org Subject: [Lucene.Net] how to add a new record to existing index Hi, I already have created an index, and I want to insert an index record to this existing index everytime I insert a new record to database. For example, if I want to insert an reord (l1, 15,tom, 20,2010/01/02) to my *existing* index, how can i do this? (I dont want to create a new index, which takes too much time) my format of index is as follows: /// doc.Add(new Lucene.Net.Documents.Field( lmname, readerreader1[lmname].ToString(), //new System.IO.StringReader(readerreader[cname].ToString()), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED) ); //lmid doc.Add(new Lucene.Net.Documents.Field( lmid, readerreader1[lmid].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // nick name of user doc.Add(new Lucene.Net.Documents.Field( nickName, readerreader1[nickName].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // uid doc.Add(new Lucene.Net.Documents.Field( uid, readerreader1[uid].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); // acttime doc.Add(new Lucene.Net.Documents.Field( acttime, readerreader1[acttime].ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.UN_TOKENIZED)); writer.AddDocument(doc); /// Thanks, Wen
[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002159#comment-13002159 ] Ryan McKinley commented on SOLR-2399: - Any thoughts on implementing with velocity templates? I don't want to slow this down since any effort is great! but long term, it would be great to drop JSP completly Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002182#comment-13002182 ] Stefan Matheis (steffkes) commented on SOLR-2399: - Ryan, actually not - but that is only based on the fact, that i've never worked with them. After a first look on http://velocity.apache.org/ there is no Getting Started-Noob-Stefan-Tutorial, no Getting Started at all ;o .. but i'll check this. Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002182#comment-13002182 ] Stefan Matheis (steffkes) edited comment on SOLR-2399 at 3/3/11 8:27 PM: - Ryan, actually not - but that is only based on the fact, that i've never worked with them. // -Edit After having a first look, i'm not really sure what/how that should help? the index.jsp is needed to gather Information about Cores [using org.apache.solr.core.CoreContainer] .. and i (just) don't see, if (and if so, how) it is possible to pass that information to the Velocity-Thingy. If that could be done .. no point about dropping that index.jsp out of order :) was (Author: steffkes): Ryan, actually not - but that is only based on the fact, that i've never worked with them. After a first look on http://velocity.apache.org/ there is no Getting Started-Noob-Stefan-Tutorial, no Getting Started at all ;o .. but i'll check this. Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: issue with automatic iterable detection?
Andi Vajda va...@apache.org wrote: Bill, Did that solve your problem ? Haven't had a chance to try it yet. Will report back when I do. Bill Andi.. On Feb 28, 2011, at 20:05, Andi Vajda va...@apache.org wrote: On Sun, 27 Feb 2011, Bill Janssen wrote: Andi Vajda va...@apache.org wrote: It may be simplest if you can send me the source file for this class as well as a small jar file I can use to reproduce this ? Turns out to be simple to reproduce. Put the attached in a file called test.java, and run this sequence: % javac -classpath . test.java % jar cf test.jar *.class % python -m jcc.__main__ --python test --shared --jar /tmp/test.jar --build --vmarg -Djava.awt.headless=true This was a tougher one. It was triggered by a combination of things: - no wrapper requested for java.io.File or --package java.io - a subclass of a parameterized class or interface implementor of a parameterized interface wasn't pulling in classes used as type parameters (java.io.File here). A fix is checked into jcc trunk/branch_3x rev 1075642. This also includes the earlier fix about using absolute class names. Andi..
[jira] Created: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
[ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Matheis (steffkes) updated SOLR-2400: Attachment: 110303_FieldAnalysisRequestHandler_output.xml FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
[ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Matheis (steffkes) updated SOLR-2400: Attachment: 110303_FieldAnalysisRequestHandler_view.png FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2822) TimeLimitingCollector starts thread in static {} with no way to stop them
[ https://issues.apache.org/jira/browse/LUCENE-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002260#comment-13002260 ] Robert Muir commented on LUCENE-2822: - bq. I think this is still the best variant, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive. Sorry its too funny, playing with LUCENE-2948 I saw a big slowdown on windows that mike didn't see on linux... finally tracked it down to an uncommented nanoTime :) TimeLimitingCollector starts thread in static {} with no way to stop them - Key: LUCENE-2822 URL: https://issues.apache.org/jira/browse/LUCENE-2822 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir See the comment in LuceneTestCase. If you even do Class.forName(TimeLimitingCollector) it starts up a thread in a static method, and there isn't a way to kill it. This is broken. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002289#comment-13002289 ] Mark Miller commented on LUCENE-1824: - This has 4 votes and 5 watchers - is it ready to go in? FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Environment: any Reporter: Alex Vigdor Assignee: Koji Sekiguchi Priority: Minor Fix For: 4.0 Attachments: LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
[ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002292#comment-13002292 ] Uwe Schindler commented on SOLR-2400: - The position is used e.g. in analysis.jsp to do exactly what you want to have. It is the token position. If no broken TokenFilters are used that do not correctly modify the posIncr attribute, you can simply use it for alignment. FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper
FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper - Key: LUCENE-2949 URL: https://issues.apache.org/jira/browse/LUCENE-2949 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0.3, 4.0 Reporter: Grant Ingersoll Priority: Minor Fix For: 3.2, 4.0 Based on my reading of the FieldTermStack constructor that loads the vector from disk, we could probably save a bunch of time and memory by using the TermVectorMapper callback mechanism instead of materializing the full array of terms into memory and then throwing most of them out. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002299#comment-13002299 ] Robert Muir commented on LUCENE-1824: - just an idea: it seems like using a breakiterator would be the way to go here. FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Environment: any Reporter: Alex Vigdor Assignee: Koji Sekiguchi Priority: Minor Fix For: 4.0 Attachments: LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002310#comment-13002310 ] Hoss Man commented on SOLR-2399: bq. the index.jsp is needed to gather Information about Cores [using org.apache.solr.core.CoreContainer] This is where the core admin handler should be useful -- you can use it to get a list of cores and their statuses. In the example solr.xml (and by default if no solr.xml exists) it's available at /admin/cores but that can be changed -- for now your JSP should be able to ask the CoreContainer for it using getAdminPath() (If it would be useful, we could also add a simple bit of info to the SystemInfoRequestHandler ((/_corename_/admin/system) output to let the UI (and external clients) know what path (if any) they can use to access the CoreAdminHandler if all they have is the URL for a single core.) bq. I don't want to slow this down since any effort is great! but long term, it would be great to drop JSP completly i agree it would be nice to show off using the velocity writer to style handler responses in the admin ui, but i think that the general approach of using a jsp (or servlet) as the master controller for creating a base HTML page that then uses javascript to query all of the individual handler APIs makes a lot of sense -- if for no other reason then that i don't think the velocity writer could really be used on the output of the CoreAdminHandler (can it? .. what context would it load the templates form?) Ultimately the problem we're always going to run into is that people can customize the paths of things in their configs - not just CoreAdminHandler but even all of hte various core specific admin handlers. I don't think that's something we really have to be worried about right now (the existing admin UI certainly doesn't) but using a simple servlet/index.jsp gives us the ability to at least start with a direct java call to answer the question: what is the url of the coreadmin handler? and then from there everything can be dynamicly driven. If the logic in the JSP is simple enough, and the real work is done in the javascript, then porting that JSP to velocity should ultimately be pretty straight forward (if there is a strong desire) Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
[ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002313#comment-13002313 ] Stefan Matheis (steffkes) commented on SOLR-2400: - Uwe, that was the first thing i thought myself, yes - but .. let's take flat (starting on position 4) and follow it. Passing StopFilter, still position 4; Arriving at WordDelimiter, it's position 6 - the dash was dropped out while beeing an StopWord and VA902B gets splitted up in three Tokens. So, what i guess, that it's missing .. is some type of information, that for example the original Token on position 2 (VA902B) is splitted an know (partial) placed on position 3 through 6 .. also for example, that flat is no longer position 4, because it's moved to 6. Or did i just miss something really simple? FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: issue with automatic iterable detection?
On Thu, 3 Mar 2011, Andi Vajda wrote: Indeed, this is why I put that assertion there :-) It's a bit of guesswork what all the possibilities are there. I'll add support for arrays there. Fix is checked into rev 1076883. Back to you, Bill. Thanks ! Andi.. Andi.. On Thu, 3 Mar 2011, Bill Janssen wrote: This looks like a problem. This is with an svn checkout of branch_3x. Bill 122, in _run_module_as_main __main__, fname, loader, pkg_name) File /usr/lib/python2.6/runpy.py, line 34, in _run_code exec code in run_globals File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/__main__.py, line 98, in module cpp.jcc(sys.argv) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 548, in jcc addRequiredTypes(cls, typeset, generics) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 233, in addRequiredTypes addRequiredTypes(cls, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 238, in addRequiredTypes addRequiredTypes(ta, typeset, True) File /usr/local/lib/python2.6/dist-packages/JCC-2.7-py2.6-linux-x86_64.egg/jcc/cpp.py, line 240, in addRequiredTypes raise NotImplementedError, repr(cls) NotImplementedError: Type: double[] %
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002334#comment-13002334 ] Mark Miller commented on LUCENE-2939: - My last patch is missing a couple required test compile changes - I excluded that class cause I had some test code in it. I'll put up a new patch as soon as I get a chance with the test class changes (Scorer init method gets a new param and there are a couple anonymous impls in test) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002335#comment-13002335 ] Mark Miller commented on LUCENE-2939: - Honestly, if I was not so busy, I'd say we should really get this in for 3.1. If you are doing something like desktop search, this can be a really cruel highlighter perf problem. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002340#comment-13002340 ] Mark Miller commented on LUCENE-2939: - P.S. One that is really a bad bug in my mind - we switched this to be the default and the old Highlighter did not suffer like this in these situations. Looking back over the email archives, it bit more than a few people. I'm pretty sure this bug was the impetus of the Fast Vector Highlighter (which is still valuable if you *really* do want to highlight over every token in your 3 billion word PDF file ;) ). You pay this huge perf penalty for no gain and no reason. If you are talking wikipedia size docs, it won't affect you - but for long documents, doing 10 snippets can be prohibitive, with no workaround. That is not a friendly neighborhood highlighter. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2399) Solr Admin Interface, reworked
[ https://issues.apache.org/jira/browse/SOLR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002381#comment-13002381 ] Ryan McKinley commented on SOLR-2399: - bq. If the logic in the JSP is simple enough, and the real work is done in the javascript, then porting that JSP to velocity should ultimately be pretty straight forward (if there is a strong desire) Yes, if anyone is willing to give the admin pages some much needed design love, I really don't want anythign to slow that down. In the future, if there is interest, it would be great to do this w/o JSP, the details of how will take some work. Solr Admin Interface, reworked -- Key: SOLR-2399 URL: https://issues.apache.org/jira/browse/SOLR-2399 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Priority: Minor *The idea was to create a new, fresh (and hopefully clean) Solr Admin Interface.* [Based on this [ML-Thread|http://www.lucidimagination.com/search/document/ae35e236d29d225e/solr_admin_interface_reworked_go_on_go_away]] I've quickly created a Github-Repository (Just for me, to keep track of the changes) » https://github.com/steffkes/solr-admin [This commit shows the differences|https://github.com/steffkes/solr-admin/commit/5f80bb0ea9deb4b94162632912fe63386f869e0d] between old/existing index.jsp and my new one (which is could copy-cut/paste'd from the existing one). Main Action takes place in [js/script.js|https://github.com/steffkes/solr-admin/blob/master/js/script.js] which is actually neither clean nor pretty .. just work-in-progress. Actually it's Work in Progress, so ... give it a try. It's developed with Firefox as Browser, so, for a first impression .. please don't use _things_ like Internet Explorer or so ;o Jan already suggested a bunch of good things, i'm sure there are more ideas over there :) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002394#comment-13002394 ] Grant Ingersoll commented on LUCENE-2939: - I can backport if you want. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 5555 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk// 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety Error Message: Error occurred in thread Thread-63: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc (Too many open files in system) Stack Trace: junit.framework.AssertionFailedError: Error occurred in thread Thread-63: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc (Too many open files in system) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145) /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/8/test6361311639913063277tmp/_a_1.doc (Too many open files in system) at org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety(TestIndexReaderReopen.java:833) Build Log (for compile errors): [...truncated 3110 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002400#comment-13002400 ] Mark Miller commented on LUCENE-2939: - bq. i think the offsetLength calculation needs to be inside the incrementToken? I do not follow ... incrementToken is: + @Override + public boolean incrementToken() throws IOException { +int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset(); +if (offsetCount offsetLimit input.incrementToken()) { + offsetCount += offsetLength; + return true; +} +return false; + } Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002402#comment-13002402 ] Robert Muir commented on LUCENE-2939: - Exactly, so what is the attributes values before calling input.incrementToken() ? I don't think this is good practice to work with the uninitialized values. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2939: Attachment: LUCENE-2939.patch This includes the change to the test to make it compile. Still no Changes entry. The compile change to the test is a back compat break. The Scorer needs to know the maxCharsToAnalyze setting. Have not had time to consider further yet. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002410#comment-13002410 ] Jason Rutherglen commented on SOLR-1431: What's the status of this one? CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Assignee: Noble Paul Priority: Trivial Fix For: Next Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch Original Estimate: 24h Remaining Estimate: 24h We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002432#comment-13002432 ] JohnWu commented on SOLR-1395: -- ok,ALL we have the correct result back (form slave02 to master): result name=response numFound=1 start=0 \u2212 doc str name=idMA147LL/A/str str name=nameApple 60 GB iPod with Video Playback Black/str str name=manuApple Computer Inc./str \u2212 ... note: if you use the Tomliu's patch please correct the code of queryComponent: //JohnWu correct the to ||, need decide the shards is null if (shards == null){ hasShardURL = false; }else{ hasShardURL = shards != null || shards.indexOf('/') 0; } so the queryCore can enter the distribute process and get the hits, the DocSlice cast to DocumentList If you have any problem, please ask me, we discuss it together Thanks! johnWu Integrate Katta --- Key: SOLR-1395 URL: https://issues.apache.org/jira/browse/SOLR-1395 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: Next Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, katta-solrcores.jpg, katta.node.properties, katta.zk.properties, log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, zookeeper-3.2.1.jar Original Estimate: 336h Remaining Estimate: 336h We'll integrate Katta into Solr so that: * Distributed search uses Hadoop RPC * Shard/SolrCore distribution and management * Zookeeper based failover * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002433#comment-13002433 ] Robert Muir commented on LUCENE-2939: - {quote} I see what you mean now - though I still don't understand your previous comment. I assume that it's just defaulting to 0 - 0 now? {quote} Only the first time. But imagine you try to reuse this tokenstream (maybe its not being reused now, but in the future)... the values for the last token of the previous doc are say 10 - 5... the consumer calls reset(Reader) with new document and reset(), which clears your accumulator, but this attribute is still 10 - 5 until input.incrementToken()... only then does the tokenizer update the values. Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream Key: LUCENE-2939 URL: https://issues.apache.org/jira/browse/LUCENE-2939 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2939.patch, LUCENE-2939.patch, LUCENE-2939.patch huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze things can be improved even further by respecting this setting with CachingTokenStream -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2329) old index files not deleted on slave
[ https://issues.apache.org/jira/browse/SOLR-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002439#comment-13002439 ] Ryosuke Fujita commented on SOLR-2329: -- I had similar problem, but I added modify/write permission to solruser, old index files are vanished. But my os is windows server 2008, and yours is centos. Is it not related? Who invokes replication task? old index files not deleted on slave Key: SOLR-2329 URL: https://issues.apache.org/jira/browse/SOLR-2329 Project: Solr Issue Type: Bug Components: replication (java) Affects Versions: 4.0 Environment: centos 5.5 ext3 file system Reporter: Edwin Khodabakchian Attachments: solrconfig.xml, solrconfig_slave.xml I have set up index replication (triggered on optimize). The problem I am having is the old index files are not being deleted on the slave. After each replication, I can see the old files still hanging around as well as the files that have just been pulled. This causes the data directory size to increase by the index size every replication until the disk fills up. I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup is pretty vanilla. I can reproduce this on multiple slaves. Checking the logs, I see the following error: SEVERE: SnapPull failed org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) ... 11 more lsof reveals that the file is still opened from the java process. Contents of the index data dir: master: -rw-rw-r-- 1 feeddo feeddo 191 Dec 14 01:06 _1lg.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 14 01:07 _1lg.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 14 01:12 _1lg.tii -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq -rw-rw-r-- 1 feeddo feeddo 311 Dec 14 01:12 segments_1ji -rw-rw-r-- 1 feeddo feeddo 23M Dec 14 01:12 _1lg.nrm -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 01:11 _24e.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 01:12 _24e.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 01:23 _24e.tii -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx -rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq -rw-rw-r-- 1 feeddo feeddo 311 Dec 18 01:24 segments_1xz -rw-rw-r-- 1 feeddo feeddo 23M Dec 18 01:24 _24e.nrm -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 13:15 _25z.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 13:16 _25z.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt -rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 18
[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
[ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002491#comment-13002491 ] Uwe Schindler commented on SOLR-2400: - Stefan, this is an egeneral issue of TokenStreams adding Tokens. TokenStreams that remove Tokens *should* automatically preserve position, but not even all of those do that correctly (we were fixing some of them lately). The way of how the Lucene analysis works makes it impossible to guarantee any corresponence of the position numbers. Because for the indexer its only important what comes out at the end, the steps inbetween are impossible. AnalysisReqHandler on the other hand does some bad hacks to look inside the analysis (by using temporary TokenStreams that buffer tokens), which are not the general use-case of TokenStreams. I wonder a little bit about your xml file, it only contains text and position, but it should also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes not only two of them. Is this a hand-made file or what is the problem? Which Solr version? One possibility to handle the thing might be the char offset in the original text, because that one should point to the character offset of begin and end of the token in the original stream instead of the token position, but this is likely to break for lots of TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...). The problem is incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted terms are longer than their originals. Alltogether: Its unlikely that you can implement that and it will work for all combinations of TokenStream components. FieldAnalysisRequestHandler; add information about token-relation - Key: SOLR-2400 URL: https://issues.apache.org/jira/browse/SOLR-2400 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Stefan Matheis (steffkes) Priority: Minor Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's Token-Relation (if there is special/correct word for this, please correct me). Meaning, that is actually not possible to follow the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter). Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org