date:20111221

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Dawid Weiss (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173941#comment-13173941
 ] 

Dawid Weiss commented on LUCENE-3654:
-

There's been an interesting discussion about the common use of Unsafe on 
hotspot mailing list recently (can't recall the thread now though). Some people 
even wanted Unsafe to become part of standard library (not unsafe accesses -- 
the lock checking part, but nonetheless). This guy wrote an entire off-heap 
collections library on top of Unsafe:

http://www.ohloh.net/p/java-huge-collections

I think using Unsafe with a fallback is fine, especially in small-scope methods 
that are used frequently and can be thoroughly tested. BytesRef is such an 
example to me.

This said, it would certainly help to convince Robert and others if you ran 
benchmarks alongside with and without Unsafe and show how much there is to 
gain, Shay.

 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3662) extend LevenshteinAutomata to support transpositions as primitive edits

2011-12-21 Thread Dawid Weiss (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173942#comment-13173942
 ] 

Dawid Weiss commented on LUCENE-3662:
-

Avanti Robert! :)

 extend LevenshteinAutomata to support transpositions as primitive edits
 ---

 Key: LUCENE-3662
 URL: https://issues.apache.org/jira/browse/LUCENE-3662
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3662.patch, LUCENE-3662.patch, 
 LUCENE-3662_upgrade_moman.patch, lev1.rev115.txt, lev1.rev119.txt, lev1t.txt, 
 update-moman.patch


 This would be a nice improvement for spell correction: currently a 
 transposition counts as 2 edits,
 which means users of DirectSpellChecker must use larger values of n (e.g. 2 
 instead of 1) and 
 larger priority queue sizes, plus some sort of re-ranking with another 
 distance measure for good results.
 Instead if we can integrate chapter 7 of 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 
 then you can just build an alternative DFA where a transposition is only a 
 single edit 
 (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
 According to the benchmarks in the original paper, the performance for LevT 
 looks to be very similar to Lev.
 Support for this is now in moman (https://bitbucket.org/jpbarrette/moman/) 
 thanks to Jean-Philippe 
 Barrette-LaPierre.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 11853 - Failure

2011-12-21 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11853/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock

Error Message:
null

Stack Trace:
junit.framework.AssertionFailedError: 
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
at 
org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1270)
at 
org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)




Build Log (for compile errors):
[...truncated 1335 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173954#comment-13173954
]

Uwe Schindler commented on LUCENE-3654:
---

I agree here, but before doing this, I want some non-micro-benchmarks to show
the effect. If there is no real effect, don't do it. Inside Lucene the
comparator is not so often used (mostly only in indexer/BytesRefHash) and in
TermRangeQuery. The other use cases are asserts all over the place, but they
don't count.

I would agree to the patch if the class would be renamed to something like
UnsignedBytesComparator and the part importing sun.misc.Unsafe to be outside
the main compilation unit. So if somebody compiles with a strange JVM like
Harmony (although its dead) and sun.misc.Unsafe is not available, the build
succeeds. The code in BytesRef is using reflection to load the comparator
implementation, so all is fine, it would just get ClassNotFoundEx and fallback
to the Java one. I could help in doing the ANT magic.

Optimize BytesRef comparator to use Unsafe long based comparison (when
possible)

Key: LUCENE-3654
URL: https://issues.apache.org/jira/browse/LUCENE-3654
Project: Lucene - Java
Issue Type: Improvement
Components: core/index, core/search
Reporter: Shay Banon
Attachments: LUCENE-3654.patch

Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do
long based comparisons over the bytes instead of one by one (which yields
2-4x better perf), use similar logic in BytesRef comparator. The code was
adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173959#comment-13173959
]

Uwe Schindler commented on LUCENE-3631:
---

Hi,
I committed some small cleanups and dead code removal after Clover analysis
this morning.

One thing: we have thread locals for TermVectorsReader and StoredFieldsReader.
Would it make sense to use one for DocValues, too? What do you think Simon?

Remove write access from SegmentReader and possibly move to separate class or
IndexWriter/BufferedDeletes/...
-

Key: LUCENE-3631
URL: https://issues.apache.org/jira/browse/LUCENE-3631
Project: Lucene - Java
Issue Type: Task
Components: core/index
Affects Versions: 4.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
Attachments: LUCENE-3631.patch, LUCENE-3631.patch

After LUCENE-3606 is finished, there are some TODOs:
SegmentReader still contains (package-private) all delete logic including
crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader
itsself could be read-only like all other IndexReaders.
There are two possibilities to do this:
# the simple one: Subclass SegmentReader and make a RWSegmentReader that is
only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use
the read-only SegmentReader. This would move all TODOs to a separate class.
It's reopen/clone method would always create a RO-SegmentReader (for NRT).
# Remove all write and commit stuff from SegmentReader completely and move it
to IndexWriter's readerPool (it must be in readerPool as deletions need a
not-changing view on an index snapshot).
Unfortunately the code is so complicated and I have no real experience in
those internals of IndexWriter so I did not want to do it with LUCENE-3606, I
just separated the code in SegmentReader and marked with TODO. Maybe Mike
McCandless can help :-)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173962#comment-13173962
 ] 

Robert Muir commented on LUCENE-3654:
-

the reason I am -1, I don't want JVM crashes.

This is lucene java, users can expect not to have JVM crashes because of 
bytesref 
bugs in lucene (this class is used all over the place), they shoudl get AIOOBE 
and NPE
and other things.

So all is not fine just because it has a fallback. 

Convincing that there is performance win is a waste of time, this method is not 
a hotspot.
Convincing me that nobody will get jvm crashes is going to be difficult.


 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3662) extend LevenshteinAutomata to support transpositions as primitive edits

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173965#comment-13173965
 ] 

Uwe Schindler commented on LUCENE-3662:
---

How many beers did you need for that?

 extend LevenshteinAutomata to support transpositions as primitive edits
 ---

 Key: LUCENE-3662
 URL: https://issues.apache.org/jira/browse/LUCENE-3662
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3662.patch, LUCENE-3662.patch, 
 LUCENE-3662_upgrade_moman.patch, lev1.rev115.txt, lev1.rev119.txt, lev1t.txt, 
 update-moman.patch


 This would be a nice improvement for spell correction: currently a 
 transposition counts as 2 edits,
 which means users of DirectSpellChecker must use larger values of n (e.g. 2 
 instead of 1) and 
 larger priority queue sizes, plus some sort of re-ranking with another 
 distance measure for good results.
 Instead if we can integrate chapter 7 of 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 
 then you can just build an alternative DFA where a transposition is only a 
 single edit 
 (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
 According to the benchmarks in the original paper, the performance for LevT 
 looks to be very similar to Lev.
 Support for this is now in moman (https://bitbucket.org/jpbarrette/moman/) 
 thanks to Jean-Philippe 
 Barrette-LaPierre.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173971#comment-13173971
 ] 

Robert Muir commented on LUCENE-3654:
-

Here's an example, since so much of the lucene codebase has bugs with bytesref 
offsets, i figure its a good example:
{noformat}
  public void testOops() {
BytesRef b = new BytesRef(abcdefghijklmnop);
b.offset = -545454544; // some bug, integer overflows and goes negative or 
other problem
System.out.println(b.compareTo(new BytesRef(abcdefghijklmnop)));
  }
{noformat}

With this patch, this gives me a SIGSEGV:
{noformat}
junit-sequential:
[junit] Testsuite: org.apache.lucene.util.TestBytesRef
[junit] #
[junit] # A fatal error has been detected by the Java Runtime Environment:
[junit] #
[junit] #  SIGSEGV (0xb) at pc=0x7f386e7dcf64, pid=6093, 
tid=139880338200320
[junit] #
[junit] # JRE version: 6.0_24-b07
[junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode 
linux-amd64 compressed oops)
[junit] # Problematic frame:
[junit] # V  [libjvm.so+0x76ef64]
[junit] #
{noformat}


 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Simon Willnauer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173972#comment-13173972
 ] 

Simon Willnauer commented on LUCENE-3654:
-

can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are 
two camps here and that could make everybody happy? I mean if you use this 
option you have to expect possible problems no?

 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973
 ] 

Uwe Schindler commented on LUCENE-3654:
---

The SIGSEGV can be solved by doing some safety checks at the beginning of 
compare: check that offset=0 and offset+length=bytes.length. If you use 
Unsafe, you have to make sure that your parameters are 1000% correct, that's 
all. This is why java.nio does lots of checks in their Buffer methods.

 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...

2011-12-21 Thread Simon Willnauer (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173975#comment-13173975
]

Simon Willnauer commented on LUCENE-3631:
-

bq. One thing: we have thread locals for TermVectorsReader and
StoredFieldsReader. Would it make sense to use one for DocValues, too? What do
you think Simon?
I don't see a need for this. The source is cached in the DocValues instance and
DocValues instances can be shared across thread.

Remove write access from SegmentReader and possibly move to separate class or
IndexWriter/BufferedDeletes/...
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173976#comment-13173976
 ] 

Uwe Schindler commented on LUCENE-3654:
---

bq. can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there 
are two camps here and that could make everybody happy? I mean if you use this 
option you have to expect possible problems no?

We can put the whole comparator to contrib and BytesRef can have a static 
setter to change the default impl. Or we use SPI for it (contrib exports it in 
META-INF) :-)

 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173977#comment-13173977
 ] 

Robert Muir commented on LUCENE-3654:
-

Sorry, i'm totally against the change, even with safety checks. 

I think this will hurt the reputation of project, and i think it will be a 
nightmare for developers too (Sorry, i dont want to debug avoidable jvm 
crashes).
And I don't want to see Lucene start using unsafe everywhere. This is 
lucene-java, things like bounds checking are part of the language.



 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Solr-trunk - Build # 1711 - Still Failing

2011-12-21 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Solr-trunk/1711/

No tests ran.

Build Log (for compile errors):
[...truncated 37341 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173983#comment-13173983
]

Uwe Schindler commented on LUCENE-3631:
---

bq. The source is cached in the DocValues instance and DocValues instances can
be shared across thread.

Thanks, I just wanted to make sure that there is no synchronization on
DocValues. A customer of mine had huge improvements with loading stored fields
since this is in Lucene.

Remove write access from SegmentReader and possibly move to separate class or
IndexWriter/BufferedDeletes/...
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Uwe Schindler (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973
]

Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:08 AM:
--

The SIGSEGV can be solved by doing some safety checks at the beginning of
compare: check that offset=0 and offset+length=bytes.length. If you use
Unsafe, you have to make sure that your parameters are 1000% correct, that's
all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local
variable at the beginning and before the bounds checks (because otherwise
another thread could change the *public* npon-final fields in BytesRef and
cause OOM). BytesRef is a user-visible class so it must be 100% safe against
all usage-violations.

Based on this additional overhead, the whole comparator makes no sense except
for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases
shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it
as term comparator, this can be completely outside Lucene. You can even use
Guava.

was (Author: thetaphi):
The SIGSEGV can be solved by doing some safety checks at the beginning of
compare: check that offset=0 and offset+length=bytes.length. If you use
Unsafe, you have to make sure that your parameters are 1000% correct, that's
all. This is why java.nio does lots of checks in their Buffer methods.

Optimize BytesRef comparator to use Unsafe long based comparison (when
possible)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

2011-12-21 Thread Uwe Schindler (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973
 ] 

Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:09 AM:
--

The SIGSEGV can be solved by doing some safety checks at the beginning of 
compare: check that offset=0 and offset+length=bytes.length. If you use 
Unsafe, you have to make sure that your parameters are 1000% correct, that's 
all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local 
variable at the beginning and before the bounds checks (because otherwise 
another thread could change the *public* non-final fields in BytesRef and cause 
SIGSEGV). BytesRef is a user-visible class so it must be 100% safe against all 
usage-violations.

Based on this additional overhead, the whole comparator makes no sense except 
for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases 
shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it 
as term comparator, this can be completely outside Lucene. You can even use 
Guava.

  was (Author: thetaphi):
The SIGSEGV can be solved by doing some safety checks at the beginning of 
compare: check that offset=0 and offset+length=bytes.length. If you use 
Unsafe, you have to make sure that your parameters are 1000% correct, that's 
all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local 
variable at the beginning and before the bounds checks (because otherwise 
another thread could change the *public* npon-final fields in BytesRef and 
cause OOM). BytesRef is a user-visible class so it must be 100% safe against 
all usage-violations.

Based on this additional overhead, the whole comparator makes no sense except 
for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases 
shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it 
as term comparator, this can be completely outside Lucene. You can even use 
Guava.
  
 Optimize BytesRef comparator to use Unsafe long based comparison (when 
 possible)
 

 Key: LUCENE-3654
 URL: https://issues.apache.org/jira/browse/LUCENE-3654
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index, core/search
Reporter: Shay Banon
 Attachments: LUCENE-3654.patch


 Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do 
 long based comparisons over the bytes instead of one by one (which yields 
 2-4x better perf), use similar logic in BytesRef comparator. The code was 
 adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2983) Unable to load custom MergePolicy

2011-12-21 Thread Mathias Herberts (Created) (JIRA)

Unable to load custom MergePolicy
-

 Key: SOLR-2983
 URL: https://issues.apache.org/jira/browse/SOLR-2983
 Project: Solr
  Issue Type: Bug
Reporter: Mathias Herberts


As part of a recent upgrade to Solr 3.5.0 we encountered an error related to 
our use of LinkedIn's ZoieMergePolicy.

It seems the code that loads a custom MergePolicy was at some point moved into 
SolrIndexConfig.java from SolrIndexWriter.java, but as this code was copied 
verbatim it now contains a bug:

try {
  policy = (MergePolicy) 
schema.getResourceLoader().newInstance(mpClassName, null, new 
Class[]{IndexWriter.class}, new Object[]{this});
} catch (Exception e) {
  policy = (MergePolicy) 
schema.getResourceLoader().newInstance(mpClassName);
}

'this' is no longer an IndexWriter but a SolrIndexConfig, therefore the call to 
newInstance will always throw an exception and the catch clause will be 
executed. If the custom MergePolicy does not have a default constructor (which 
is the case of ZoieMergePolicy), the second attempt to create the MergePolicy 
will also fail and Solr won't start.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Santiago M. Mola (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Santiago M. Mola updated LUCENE-3663:
-

Attachment: PhoneFilter.java

This is a proof-of-concept TokenFilter that does the job using Google's
libphonenumber (https://code.google.com/p/libphonenumber/).

Each token is converted to a phone number in international format, using a
default country for guessing country code if needed. If the token is not a
valid phone number, it's filtered out.

Add a phone number normalization TokenFilter

Key: LUCENE-3663
URL: https://issues.apache.org/jira/browse/LUCENE-3663
Project: Lucene - Java
Issue Type: New Feature
Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
Attachments: PhoneFilter.java

Phone numbers can be found in the wild in an infinity variety of formats
(e.g. with spaces, parenthesis, dashes, with or without country code, with
letters in substitution of numbers). So some Lucene applications can benefit
of phone normalization with a TokenFilter that gets a phone number in any
format, and outputs it in a standard format, using a default country to guess
country code if it's not present.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3660) If indexwriter hits a non-ioexception from indexExists it leaks a write.lock

2011-12-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173997#comment-13173997
 ] 

Michael McCandless commented on LUCENE-3660:


+1, good catch!

 If indexwriter hits a non-ioexception from indexExists it leaks a write.lock
 

 Key: LUCENE-3660
 URL: https://issues.apache.org/jira/browse/LUCENE-3660
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-3660.patch


 the rest of IW's ctor is careful about this.
 IndexReader.indexExists catches any IOException and returns false, but the 
 problem
 occurs if some other exception (in my test, UnsupportedOperationException, 
 but you
 can imagine others are possible), when trying to e.g. read in the segments 
 file.
 I think we just need to move the IR.exists stuff inside the try / finally

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3605) revisit segments.gen sleeping

2011-12-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173995#comment-13173995
 ] 

Michael McCandless commented on LUCENE-3605:


Woops -- I'll nuke the getter!

 revisit segments.gen sleeping
 -

 Key: LUCENE-3605
 URL: https://issues.apache.org/jira/browse/LUCENE-3605
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-3605.patch


 in LUCENE-3601, i worked up a change where we intentionally crash() all 
 un-fsynced files 
 in tests to ensure that we are calling sync on files when we should.
 I think this would be nice to do always (and with some fixes all tests pass).
 But this is super-slow sometimes because when we corrupt the unsynced 
 segments.gen, it causes
 SIS.read to take 500ms each time (and in checkindex for some reason we do 
 this twice, which seems wrong).
 I can workaround this for now for tests (just do a partial crash that avoids 
 corrupting the segments.gen),
 but I wanted to create this issue for discussion about the 
 sleeping/non-fsyncing of segments.gen, just
 because i guess its possible someone could hit this slowness.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3661) move deletes under codec

2011-12-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174000#comment-13174000
 ] 

Michael McCandless commented on LUCENE-3661:


This sounds like a great plan!  So then the use of BitVector is an impl detail 
to the codec...

 move deletes under codec
 

 Key: LUCENE-3661
 URL: https://issues.apache.org/jira/browse/LUCENE-3661
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 4.0
Reporter: Robert Muir

 After LUCENE-3631, this should be easier I think.
 I haven't looked at it much myself but i'll play around a bit, but at a 
 glance:
 * SegmentReader to have Bits liveDocs instead of BitVector
 * address the TODO in the IW-using ctors so that SegmentReader doesn't take a 
 parent but just an existing core.
 * we need some type of minimal MutableBits or similar subinterface of bits. 
 BitVector and maybe Fixed/OpenBitSet could implement it
 * BitVector becomes an impl detail and moves to codec (maybe we have a shared 
 base class and split the 3.x/4.x up rather than the conditional backwards)
 * I think the invertAll should not be used by IndexWriter, instead we define 
 the codec interface to say give me a new MutableBits, by default all are 
 set ?
 * redundant internally-consistent checks in checkLiveCounts should be done in 
 the codec impl instead of in SegmentReader.
 * plain text impl in SimpleText.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...

2011-12-21 Thread Uwe Schindler (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-3631:
--

Attachment: LUCENE-3631-threadlocals.patch

This patch also moves the threadlocals to SegmentCoreReaders, as they can be
reused on reopen/nrt readers. Also improve ensureOpen() checks to guard
everything without duplicating checks.

Remove write access from SegmentReader and possibly move to separate class or
IndexWriter/BufferedDeletes/...
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174012#comment-13174012
 ] 

Uwe Schindler commented on LUCENE-3631:
---

Heavy committed at revision: 1221677

 Remove write access from SegmentReader and possibly move to separate class or 
 IndexWriter/BufferedDeletes/...
 -

 Key: LUCENE-3631
 URL: https://issues.apache.org/jira/browse/LUCENE-3631
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index
Affects Versions: 4.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Attachments: LUCENE-3631-threadlocals.patch, LUCENE-3631.patch, 
 LUCENE-3631.patch


 After LUCENE-3606 is finished, there are some TODOs:
 SegmentReader still contains (package-private) all delete logic including 
 crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader 
 itsself could be read-only like all other IndexReaders.
 There are two possibilities to do this:
 # the simple one: Subclass SegmentReader and make a RWSegmentReader that is 
 only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use 
 the read-only SegmentReader. This would move all TODOs to a separate class. 
 It's reopen/clone method would always create a RO-SegmentReader (for NRT).
 # Remove all write and commit stuff from SegmentReader completely and move it 
 to IndexWriter's readerPool (it must be in readerPool as deletions need a 
 not-changing view on an index snapshot).
 Unfortunately the code is so complicated and I have no real experience in 
 those internals of IndexWriter so I did not want to do it with LUCENE-3606, I 
 just separated the code in SegmentReader and marked with TODO. Maybe Mike 
 McCandless can help :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 11853 - Failure

2011-12-21 Thread Michael McCandless

I can't reproduce this one...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 21, 2011 at 3:45 AM, Apache Jenkins Server
jenk...@builds.apache.org wrote:
 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11853/

 1 tests failed.
 REGRESSION:  
 org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock

 Error Message:
 null

 Stack Trace:
 junit.framework.AssertionFailedError:
        at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
        at 
 org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        at 
 org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1270)
        at 
 org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)




 Build Log (for compile errors):
 [...truncated 1335 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022
 ] 

Uwe Schindler commented on LUCENE-3663:
---

This looks strange and creates useless objects:

{code:java}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);
{code}

should be:

{code:java}
try {
PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);
{code}

Ideally, PhoneNumberUtil would take CharSequence, but unfortunately Google's 
lib is too stupid to use a more generic Java type.

Otherwise patch looks fine, but it adds another external reference. You should 
make all fields final, they will never change!

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022
 ] 

Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:34 AM:
--

This looks strange and creates useless objects:

{code:java}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);
{code}

should be:

{code:java}
try {
PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);
{code}

Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass 
termAtt without toString()), but unfortunately Google's lib is too stupid to 
use a more generic Java type.

Otherwise patch looks fine, but it adds another external library. You should 
make all fields final, they will never change!

  was (Author: thetaphi):
This looks strange and creates useless objects:

{code:java}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);
{code}

should be:

{code:java}
try {
PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);
{code}

Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass 
termAtt without toString()), but unfortunately Google's lib is too stupid to 
use a more generic Java type.

Otherwise patch looks fine, but it adds another external reference. You should 
make all fields final, they will never change!
  
 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022
 ] 

Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:33 AM:
--

This looks strange and creates useless objects:

{code:java}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);
{code}

should be:

{code:java}
try {
PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);
{code}

Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass 
termAtt without toString()), but unfortunately Google's lib is too stupid to 
use a more generic Java type.

Otherwise patch looks fine, but it adds another external reference. You should 
make all fields final, they will never change!

  was (Author: thetaphi):
This looks strange and creates useless objects:

{code:java}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);
{code}

should be:

{code:java}
try {
PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);
{code}

Ideally, PhoneNumberUtil would take CharSequence, but unfortunately Google's 
lib is too stupid to use a more generic Java type.

Otherwise patch looks fine, but it adds another external reference. You should 
make all fields final, they will never change!
  
 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174024#comment-13174024
 ] 

Uwe Schindler commented on LUCENE-3663:
---

One more thing, as you want to filter out tokens, you should not subclass 
TokenFilter directly but instead sublass 
org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the 
match() method. You are free to modify the token there, too. This new base 
class would correctly handle position increments, as noted as TODO in your 
comments.

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174024#comment-13174024
]

Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:39 AM:
--

One more thing, as you want to filter out tokens, you should not subclass
TokenFilter directly but instead sublass
org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the
accept() method. You are free to modify the token there, too. This new base
class would correctly handle position increments, as noted as TODO in your
comments.

was (Author: thetaphi):
One more thing, as you want to filter out tokens, you should not subclass
TokenFilter directly but instead sublass
org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the
match() method. You are free to modify the token there, too. This new base
class would correctly handle position increments, as noted as TODO in your
comments.

Add a phone number normalization TokenFilter

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174025#comment-13174025
 ] 

Michael McCandless commented on LUCENE-3663:


+1 I think this would be a useful addition.

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174026#comment-13174026
 ] 

Robert Muir commented on LUCENE-3663:
-

I think actually that we should not remove tokens that aren't phone numbers. 
sometimes there just might be other
things instead of phone numbers, or maybe the phone number 
detection/normalization is just imperfect so its better
 to not throw away, instead just no normalization happens, like a stemmer.

In general we can also assume the text is unstructured and might have other 
stuff (this implies someone has a super-cool 
tokenizer that doesnt split up any dirty phone numbers, but we just leave the 
possibility)

Then i think the while loop could be removed, if the phone number normalization 
succeeds mark the type as phone.
Otherwise in the exception case, output it unchanged.

then non-phonenumbers or whatever can be easily filtered out separately with a 
subclass of FilteringTokenFilter.

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver

2011-12-21 Thread Steve Wolfe (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174042#comment-13174042
 ] 

Steve Wolfe commented on SOLR-2980:
---

After careful comparison of the working vs. non-working machines I identified 
that the non-working machines were using a slightly newer build of the JRE 
(both were using 1.6.0_20, but two different builds of that same runtime).  By 
explicitly installing the older version all issues went away.  During 
diagnostics I had also found that the issue was not specific to Solr, but 
rather appeared to be between the affected JRE and the SQL Server JDBC driver.

Good build of the JRE: 
http://pkgs.org/centos-5-rhel-5/centos-rhel-x86_64/java-1.6.0-openjdk-1.6.0.0-1.22.1.9.8.el5_6.x86_64.rpm.html
Bad build of the JRE: 
http://pkgs.org/centos-5-rhel-5/centos-rhel-updates-x86_64/java-1.6.0-openjdk-src-1.6.0.0-1.23.1.9.10.el5_7.x86_64.rpm.html

 DataImportHandler becomes unresponsive with Microsoft JDBC driver
 -

 Key: SOLR-2980
 URL: https://issues.apache.org/jira/browse/SOLR-2980
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.4, 3.5
 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 
 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0
Reporter: Steve Wolfe
  Labels: dataimport, jdbc, solr, sql, sqlserver

 A solr core has been configured to use the DataImportHandler to read a set of 
 documents from a Microsoft SQL Server database, via the Microsoft JDBC 
 driver.  A known-good configuration for the data import handler is used, and 
 a reload-config followed by full-import command are issued to the 
 DataImportHandler.
 The handler switches to a status of A command is still running..., and 
 shows 1 request has been made to the data source.  Subsequent status calls 
 show the Time Elapsed growing, but the handler fails to perform any 
 action--SQL Server confirms that a login event occurs, but no queries are 
 issued.  Solr does not throw any exceptions, even after a very long duration. 
  The last message in Solr's output is INFO: Creating a connection for entity 
 {entity name} with URL: {entity datasource url}
 Attempts to issue an Abort command to the DataImportHandler appear 
 successful, but do no stop the operation.
 Running the solr instance with the java -verbose flag shows the following:
 *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND*
 [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamDone from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 *APPROXIMATELY 40 SECONDS LATER*
 [Loaded java.io.InterruptedIOException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded java.net.SocketTimeoutException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded sun.net.ConnectionResetException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 An issue with identical symptoms has been reported on StackOverflow (the OP 
 found that using a 3rd party JDBC driver appeared successful): 
 http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver

2011-12-21 Thread Steve Wolfe (Closed) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Wolfe closed SOLR-2980.
-

Resolution: Not A Problem

Determined that the issue is not Solr-specific, but rather it occurs between 
affected versions/builds of the JRE and the MS SQL JDBC driver.  See comment 
for details.

 DataImportHandler becomes unresponsive with Microsoft JDBC driver
 -

 Key: SOLR-2980
 URL: https://issues.apache.org/jira/browse/SOLR-2980
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.4, 3.5
 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 
 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0
Reporter: Steve Wolfe
  Labels: dataimport, jdbc, solr, sql, sqlserver

 A solr core has been configured to use the DataImportHandler to read a set of 
 documents from a Microsoft SQL Server database, via the Microsoft JDBC 
 driver.  A known-good configuration for the data import handler is used, and 
 a reload-config followed by full-import command are issued to the 
 DataImportHandler.
 The handler switches to a status of A command is still running..., and 
 shows 1 request has been made to the data source.  Subsequent status calls 
 show the Time Elapsed growing, but the handler fails to perform any 
 action--SQL Server confirms that a login event occurs, but no queries are 
 issued.  Solr does not throw any exceptions, even after a very long duration. 
  The last message in Solr's output is INFO: Creating a connection for entity 
 {entity name} with URL: {entity datasource url}
 Attempts to issue an Abort command to the DataImportHandler appear 
 successful, but do no stop the operation.
 Running the solr instance with the java -verbose flag shows the following:
 *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND*
 [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamDone from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 *APPROXIMATELY 40 SECONDS LATER*
 [Loaded java.io.InterruptedIOException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded java.net.SocketTimeoutException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded sun.net.ConnectionResetException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 An issue with identical symptoms has been reported on StackOverflow (the OP 
 found that using a 3rd party JDBC driver appeared successful): 
 http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174044#comment-13174044
 ] 

Uwe Schindler commented on SOLR-2980:
-

See my comments about the mess with OpenJDK version numbers, you cannot read 
anything out of it: My advise: Don't use OpenJDK and download real Oracle JDKs 
- please!

http://blog.thetaphi.de/2011/12/jdk-7u2-released-how-about-linux-and.html

 DataImportHandler becomes unresponsive with Microsoft JDBC driver
 -

 Key: SOLR-2980
 URL: https://issues.apache.org/jira/browse/SOLR-2980
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.4, 3.5
 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 
 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0
Reporter: Steve Wolfe
  Labels: dataimport, jdbc, solr, sql, sqlserver

 A solr core has been configured to use the DataImportHandler to read a set of 
 documents from a Microsoft SQL Server database, via the Microsoft JDBC 
 driver.  A known-good configuration for the data import handler is used, and 
 a reload-config followed by full-import command are issued to the 
 DataImportHandler.
 The handler switches to a status of A command is still running..., and 
 shows 1 request has been made to the data source.  Subsequent status calls 
 show the Time Elapsed growing, but the handler fails to perform any 
 action--SQL Server confirms that a login event occurs, but no queries are 
 issued.  Solr does not throw any exceptions, even after a very long duration. 
  The last message in Solr's output is INFO: Creating a connection for entity 
 {entity name} with URL: {entity datasource url}
 Attempts to issue an Abort command to the DataImportHandler appear 
 successful, but do no stop the operation.
 Running the solr instance with the java -verbose flag shows the following:
 *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND*
 [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 [Loaded com.microsoft.sqlserver.jdbc.StreamDone from 
 file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar]
 *APPROXIMATELY 40 SECONDS LATER*
 [Loaded java.io.InterruptedIOException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded java.net.SocketTimeoutException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 [Loaded sun.net.ConnectionResetException from 
 /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar]
 An issue with identical symptoms has been reported on StackOverflow (the OP 
 found that using a 3rd party JDBC driver appeared successful): 
 http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Santiago M. Mola (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174046#comment-13174046
]

Santiago M. Mola commented on LUCENE-3663:
--

@Uwe: Thanks for the comments.

@Robert: Then this filter would mark phone tokens as PHONE type and I could
filter non-PHONE tokens with a subsequent filter? In my specific use case, I
need to throw away any token that could not be normalized, so I have to, at
least, mark phone tokens for removal in further steps. If tokens are not
marked, then we would have to check twice if the token is a valid phone.

Add a phone number normalization TokenFilter

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Santiago M. Mola (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174058#comment-13174058
 ] 

Santiago M. Mola commented on LUCENE-3663:
--

Bug report for libphonenumber in order to get it to support CharSequence: 
https://code.google.com/p/libphonenumber/issues/detail?id=84

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174074#comment-13174074
 ] 

Robert Muir commented on LUCENE-3663:
-

Santiago, yeah i think if normalization is successful, you would change the 
type to PHONE as it was recognized as one.
otherwise when you get the exception, just 'return true' and leave all 
attributes unchanged.

in the successful case, besides setting the type, if you wanted you could even 
not throw away the PhoneNumber or whatever
but instead put it in an attribute. This way if someone wanted to do more 
complicated stuff the attributes are at least available,
but its also useful for things like solr's analysis.jsp just for debugging how 
the analysis worked.

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter

2011-12-21 Thread Uwe Schindler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174078#comment-13174078
 ] 

Uwe Schindler commented on LUCENE-3663:
---

bq. Then this filter would mark phone tokens as PHONE type and I could filter 
non-PHONE tokens with a subsequent filter?

YES!.

The FilteringTokenFilter subclass you then would add after this filterw ould 
simply has this accept() method:

{code:java}
protected boolean accept() {
 return PHONE.equals(typeAtt.getType());
}
{code}

FilteringTokenFilter would then also support position increments correctly, 
that your filter does not.

 Add a phone number normalization TokenFilter
 

 Key: LUCENE-3663
 URL: https://issues.apache.org/jira/browse/LUCENE-3663
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Santiago M. Mola
Priority: Minor
 Attachments: PhoneFilter.java


 Phone numbers can be found in the wild in an infinity variety of formats 
 (e.g. with spaces, parenthesis, dashes, with or without country code, with 
 letters in substitution of numbers). So some Lucene applications can benefit 
 of phone normalization with a TokenFilter that gets a phone number in any 
 format, and outputs it in a standard format, using a default country to guess 
 country code if it's not present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson reassigned SOLR-2242:


Assignee: Erick Erickson  (was: Simon Willnauer)

 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174152#comment-13174152
 ] 

Erick Erickson commented on SOLR-2242:
--

OK, it seems like we have several themes here. I'd like to get a reasonable 
consensus before going forward... I'll put out a straw-man proposal here and we 
can go from there.

But lets figure out where we're going before revamping stuff yet again.

1 Distributed support. I sure don't see a good way to support this currently. 
Perhaps some of the future enhancements will make this easier (thinking 
distributed TF/IDF  such while being totally ignorant of that code), but 
returning the entire list of constraints (or names or terms or whatever we call 
it) is just a bad idea. The first time someone tries this on a field with 
1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also 
slightly anti the min/max idea. I'm not sure what value there is in telling 
someone there are between 10,000 and 90,000 distinct values. And if it's a 
field with just a few pre-defined values, that information is already known 
anyway But if someone can show a use-case here I'm not completely against 
it. But I'd like to see the use case first, not someone might find it useful 
G.

2 back compat. Cody's suggestion seems to be the slickest in terms of not 
breaking things, but we use attributes in just a few places, are there reasons 
NOT to do it that way?

3 Possibly add a new JIRA for changing the facet response format to be 
tolerant of sub-fields, but don't do that here.

Again, I want a clearly defined end point for the concerns raised before we 
dive back in here



 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174152#comment-13174152
]

Erick Erickson edited comment on SOLR-2242 at 12/21/11 3:45 PM:

OK, it seems like we have several themes here. I'd like to get a reasonable
consensus before going forward... I'll put out a straw-man proposal here and we
can go from there.

But lets figure out where we're going before revamping stuff yet again.

1 Distributed support. I sure don't see a good way to support this currently.
Perhaps some of the future enhancements will make this easier (thinking
distributed TF/IDF such while being totally ignorant of that code), but
returning the entire list of constraints (or names or terms or whatever we call
it) is just a bad idea. The first time someone tries this on a field with
1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also
slightly anti the min/max idea. I'm not sure what value there is in telling
someone there are between 10,000 and 90,000 distinct values. And if it's a
field with just a few pre-defined values, that information is already known
anyway But if someone can show a use-case here I'm not completely against
it. But I'd like to see the use case first, not someone might find it useful
G.

2 back compat. Cody's suggestion seems to be the slickest in terms of not
breaking things, but we use attributes in just a few places, are there reasons
NOT to do it that way? Or does this mess up JSON, PHP, etc?

3 Possibly add a new JIRA for changing the facet response format to be
tolerant of sub-fields, but don't do that here.

Again, I want a clearly defined end point for the concerns raised before we
dive back in here

was (Author: erickerickson):
OK, it seems like we have several themes here. I'd like to get a reasonable
consensus before going forward... I'll put out a straw-man proposal here and we
can go from there.

But lets figure out where we're going before revamping stuff yet again.

2 back compat. Cody's suggestion seems to be the slickest in terms of not
breaking things, but we use attributes in just a few places, are there reasons
NOT to do it that way?

3 Possibly add a new JIRA for changing the facet response format to be
tolerant of sub-fields, but don't do that here.

Again, I want a clearly defined end point for the concerns raised before we
dive back in here

Get distinct count of names for a facet field
-

Key: SOLR-2242
URL: https://issues.apache.org/jira/browse/SOLR-2242
Project: Solr
Issue Type: New Feature
Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
Fix For: 4.0

Attachments: NumFacetTermsFacetsTest.java,
SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch,
SOLR-2242.shard.patch, SOLR-2242.shard.patch,
SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch,
SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch

When returning facet.field=name of field you will get a list of matches for
distinct values. This is normal behavior. This patch tells you how many
distinct values you have (# of rows). Use with limit=-1 and mincount=1.
The feature is called namedistinct. Here is an example:
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
This currently only

[jira] [Commented] (SOLR-2804) Logging error causes entire DIH process to fail

2011-12-21 Thread Michael Haeusler (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174156#comment-13174156
 ] 

Michael Haeusler commented on SOLR-2804:


This problem also occurs with Solr 3.5.0. The stacktrace is almost identical:

Dec 20, 2011 11:22:36 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList 
cannot be cast to java.lang.String
at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
at org.apache.solr.common.util.NamedList.toString(NamedList.java:253)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
at 
org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)


 Logging error causes entire DIH process to fail
 ---

 Key: SOLR-2804
 URL: https://issues.apache.org/jira/browse/SOLR-2804
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 4.0
 Environment: java version 1.6.0_26
 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-384-10M3425)
 Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-384, mixed mode)
 Model Name:   MacBook Pro
   Model Identifier:   MacBookPro8,2
   Processor Name: Intel Core i7
   Processor Speed:2.2 GHz
   Number of Processors:   1
   Total Number of Cores:  4
   L2 Cache (per Core):256 KB
   L3 Cache:   6 MB
   Memory: 4 GB
 System Software Overview:
   System Version: Mac OS X 10.6.8 (10K549)
   Kernel Version: Darwin 10.8.0
Reporter: Pulkit Singhal
  Labels: dih
   Original Estimate: 48h
  Remaining Estimate: 48h

 SEVERE: Full Import failed:java.lang.ClassCastException:
 java.util.ArrayList cannot be cast to java.lang.String
at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
at org.apache.solr.common.util.NamedList.toString(NamedList.java:263)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at 
 org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
at 
 org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57)
at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265)
at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372)
at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440)
at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Jonathan Rochkind (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174179#comment-13174179
 ] 

Jonathan Rochkind commented on SOLR-2242:
-

I would find this feature valuable even if it simply did not work at all 
on a distributed index. (Refusing to return a value rather than 
returning a known incorrect value would seem like the right way to go).  
Because my index is not distributed, and I would find this feature 
valuable, heh.

I don't know if Solr currently has any policies against committing 
features that can't work on distributed, but personally my 'vote' would 
be doing that here, with clear documentation that it doesn't work on 
distributed (and the hope that future enhancements may make it more 
feasible to do so, as Erick suggests may possibly maybe happen).



 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Yonik Seeley (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174184#comment-13174184
 ] 

Yonik Seeley commented on SOLR-2242:


bq. I'm also slightly anti the min/max idea. I'm not sure what value there is 
in telling someone there are between 10,000 and 90,000 distinct values.

I think we could come up with a pretty good estimate (but we should tell them 
it's an estimate somehow).  Anyway, that could optionally be handled in a 
different issue.

bq. 2 back compat. Cody's suggestion seems to be the slickest in terms of not 
breaking things, but we use attributes in just a few places, are there reasons 
NOT to do it that way? Or does this mess up JSON, PHP, etc?

Yes, it messes up JSON, binary format, etc.  We'd need to figure out how to add 
attributes into our data model (that gets sent to response writers) in a 
generic way.

bq. 3 Possibly add a new JIRA for changing the facet response format to be 
tolerant of sub-fields, but don't do that here.

Not sure how that's possible... it's either more magic field names in with the 
individual constraints, or the facet response format has got to change.


 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2950) QueryElevationComponent needlessly looks up document ids

2011-12-21 Thread Yonik Seeley (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2950:
---

Attachment: SOLR-2950.patch

OK, just had a chance to view the comparator part of this patch.

Here's a patch that fixes
 - minor check-for-null for fields() and terms() which can return null
 - even though docsEnum returns something, it may be deleted (i.e. need to 
check for NO_MORE_DOCS)
 - use liveDocs when requesting the docsEnum so we won't use a deleted 
(overwritten) doc.

The last two issues would both cause us to miss elevated documents if they have 
been updated and an old deleted version still exists in the index.

 QueryElevationComponent needlessly looks up document ids
 

 Key: SOLR-2950
 URL: https://issues.apache.org/jira/browse/SOLR-2950
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.0

 Attachments: SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch, 
 SOLR-2950.patch


 The QueryElevationComponent needlessly instantiates a FieldCache and does 
 look ups in it for every document.  If we flipped things around a bit and got 
 Lucene internal doc ids on inform() we could then simply do a much smaller 
 and faster lookup during the sort.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2950) QueryElevationComponent needlessly looks up document ids

2011-12-21 Thread Grant Ingersoll (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174258#comment-13174258
 ] 

Grant Ingersoll commented on SOLR-2950:
---

+1, go ahead and commit.

 QueryElevationComponent needlessly looks up document ids
 

 Key: SOLR-2950
 URL: https://issues.apache.org/jira/browse/SOLR-2950
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.0

 Attachments: SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch, 
 SOLR-2950.patch


 The QueryElevationComponent needlessly instantiates a FieldCache and does 
 look ups in it for every document.  If we flipped things around a bit and got 
 Lucene internal doc ids on inform() we could then simply do a much smaller 
 and faster lookup during the sort.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-2977) QueryElevationComponent should support fake excludes

2011-12-21 Thread Grant Ingersoll (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-2977:
-

Assignee: Grant Ingersoll

 QueryElevationComponent should support fake excludes
 --

 Key: SOLR-2977
 URL: https://issues.apache.org/jira/browse/SOLR-2977
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

 It would be handy to be able to, in the QEC, simply mark documents as 
 excluded instead of completely excluding them.  This can be achieved using 
 the EditorialMarker that was recently added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2984) Function query does not work when when the value of function parameter has null.

2011-12-21 Thread Pradeep (Created) (JIRA)

Function query does not work when when the value of function parameter has null.


 Key: SOLR-2984
 URL: https://issues.apache.org/jira/browse/SOLR-2984
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 3.3
Reporter: Pradeep
Priority: Minor


To reproduce, 
sort parameter in query looks like 
sort=sum(product(Rating,0.01),product(recip(ms(NOW/HOUR,Date),3.16e-11,1,1),0.04))
 desc

and if Rating column in the database has null values, results are not sorted 
according to the output value of the function.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2984) Function query does not work when the value of function parameter has null.

2011-12-21 Thread Pradeep (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep updated SOLR-2984:
--

Summary: Function query does not work when the value of function parameter 
has null.  (was: Function query does not work when when the value of function 
parameter has null.)

 Function query does not work when the value of function parameter has null.
 ---

 Key: SOLR-2984
 URL: https://issues.apache.org/jira/browse/SOLR-2984
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 3.3
Reporter: Pradeep
Priority: Minor

 To reproduce, 
 sort parameter in query looks like 
 sort=sum(product(Rating,0.01),product(recip(ms(NOW/HOUR,Date),3.16e-11,1,1),0.04))
  desc
 and if Rating column in the database has null values, results are not sorted 
 according to the output value of the function.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-2242:
-

Attachment: SOLR-2242.patch

First step in resurrecting this. This patch should apply cleanly to trunk. It 
incorporates the SOLR-2242.patch from 28-June and the NmFacetTermsFacetsTest 
from 9-July. It accounts for the fact that things seem to have been moved 
around a bit.

 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174441#comment-13174441
 ] 

Erick Erickson edited comment on SOLR-2242 at 12/21/11 9:51 PM:


First step in resurrecting this. This patch should apply cleanly to trunk. It 
incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest 
from 9-July. It accounts for the fact that things seem to have been moved 
around a bit. All I guarantee is that the code compiles and the 
NumFacetTermsFacetsTest runs from inside IntelliJ.

  was (Author: erickerickson):
First step in resurrecting this. This patch should apply cleanly to trunk. 
It incorporates the SOLR-2242.patch from 28-June and the 
NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem 
to have been moved around a bit.
  
 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field

2011-12-21 Thread Erick Erickson (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174441#comment-13174441
 ] 

Erick Erickson edited comment on SOLR-2242 at 12/21/11 9:50 PM:


First step in resurrecting this. This patch should apply cleanly to trunk. It 
incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest 
from 9-July. It accounts for the fact that things seem to have been moved 
around a bit.

  was (Author: erickerickson):
First step in resurrecting this. This patch should apply cleanly to trunk. 
It incorporates the SOLR-2242.patch from 28-June and the NmFacetTermsFacetsTest 
from 9-July. It accounts for the fact that things seem to have been moved 
around a bit.
  
 Get distinct count of names for a facet field
 -

 Key: SOLR-2242
 URL: https://issues.apache.org/jira/browse/SOLR-2242
 Project: Solr
  Issue Type: New Feature
  Components: Response Writers
Affects Versions: 4.0
Reporter: Bill Bell
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: NumFacetTermsFacetsTest.java, 
 SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, 
 SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, 
 SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, 
 SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch


 When returning facet.field=name of field you will get a list of matches for 
 distinct values. This is normal behavior. This patch tells you how many 
 distinct values you have (# of rows). Use with limit=-1 and mincount=1.
 The feature is called namedistinct. Here is an example:
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price
 http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price
 This currently only works on facet.field.
 {code}
 lst name=facet_fields
   lst name=price
 int name=numFacetTerms14/int
 int name=0.03/intint name=11.51/intint 
 name=19.951/intint name=74.991/intint name=92.01/intint 
 name=179.991/intint name=185.01/intint name=279.951/intint 
 name=329.951/intint name=350.01/intint name=399.01/intint 
 name=479.951/intint name=649.991/intint name=2199.01/int
   /lst
 /lst
 {code} 
 Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2906) Implement LFU Cache

2011-12-21 Thread Shawn Heisey (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174475#comment-13174475
 ] 

Shawn Heisey commented on SOLR-2906:


I must be dense.  I can figure out how to add the timeDecay option, but I can't 
figure out what section of code to enable/disable based on the value of 
timeDecay.  I've gone as far as doing a diff on my Nov 24th patch and the Dec 
20th patch from Erick.  (doing diffs on diffs ... the world is going to 
explode!) The only differences I can see between the two is in 
whitespace/formatting.


 Implement LFU Cache
 ---

 Key: SOLR-2906
 URL: https://issues.apache.org/jira/browse/SOLR-2906
 Project: Solr
  Issue Type: Sub-task
  Components: search
Affects Versions: 3.4
Reporter: Shawn Heisey
Assignee: Erick Erickson
Priority: Minor
 Attachments: ConcurrentLFUCache.java, LFUCache.java, SOLR-2906.patch, 
 SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, 
 TestLFUCache.java


 Implement an LFU (Least Frequently Used) cache as the first step towards a 
 full ARC cache

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2977) QueryElevationComponent should support fake excludes

2011-12-21 Thread Grant Ingersoll (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-2977:
--

Attachment: SOLR-2977.patch

first draft.

 QueryElevationComponent should support fake excludes
 --

 Key: SOLR-2977
 URL: https://issues.apache.org/jira/browse/SOLR-2977
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-2977.patch


 It would be handy to be able to, in the QEC, simply mark documents as 
 excluded instead of completely excluding them.  This can be achieved using 
 the EditorialMarker that was recently added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 1311 - Failure

2011-12-21 Thread Apache Jenkins Server

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/1311/

1 tests failed.
REGRESSION:  org.apache.solr.search.TestRealTimeGet.testStressGetRealtime

Error Message:
java.lang.AssertionError: Some threads threw uncaught exceptions!

Stack Trace:
java.lang.RuntimeException: java.lang.AssertionError: Some threads threw 
uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:657)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:86)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
at 
org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:685)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:629)




Build Log (for compile errors):
[...truncated 11794 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2906) Implement LFU Cache

2011-12-21 Thread Erick Erickson (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-2906:
-

Attachment: SOLR-2906.patch

Here's what I had in mind, at least I *think* this will do but all I've done is 
insured that the code compiles and the current LFU test suite runs.

Look in the diff for timeDecay.

This still needs some proof that the new parameter comes through from a schema 
file. Let me know if that presents a problem or if you can't get 'round to it, 
I might have some time over Christmas.

I think maybe you were under the impression that this had already been done and 
were looking for it to be in the code already?

 Implement LFU Cache
 ---

 Key: SOLR-2906
 URL: https://issues.apache.org/jira/browse/SOLR-2906
 Project: Solr
  Issue Type: Sub-task
  Components: search
Affects Versions: 3.4
Reporter: Shawn Heisey
Assignee: Erick Erickson
Priority: Minor
 Attachments: ConcurrentLFUCache.java, LFUCache.java, SOLR-2906.patch, 
 SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, 
 SOLR-2906.patch, TestLFUCache.java


 Implement an LFU (Least Frequently Used) cache as the first step towards a 
 full ARC cache

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2841) Scriptable UpdateRequestChain

2011-12-21 Thread Lance Norskog (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174566#comment-13174566
 ] 

Lance Norskog commented on SOLR-2841:
-

+1

Another use case for scripting at the top level is multi-query queries:  
where the app creates the second based on the first. Would your proposal handle 
this problem?

Many use cases for grouping/collapsing can be implemented with 2 queries. 
Perhaps the guts of collapsing could be simplified if the more outré use cases 
could be pushed out into multiple queries.

 Scriptable UpdateRequestChain
 -

 Key: SOLR-2841
 URL: https://issues.apache.org/jira/browse/SOLR-2841
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 UpdateProcessorChains must currently be defined with XML in solrconfig.xml. 
 We should explore a scriptable chain implementation with a DSL that allows 
 for full flexibility. The first step would be to make UpdateChain 
 implementations pluggable in solrconfig.xml, for backward compat support.
 Benefits and possibilities with a Scriptable UpdateChain:
 * A compact DSL for defining Processors and Chains (Workflows would be a 
 better, less limited term here)
 * Keeping update processor config separate from solrconfig.xml gives better 
 separations of roles
 * Use this as an opportunity to natively support scripting language 
 Processors (ideas from SOLR-1725)
 This issue is spun off from SOLR-2823.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: highlight bug

2011-12-21 Thread Koji Sekiguchi


(11/12/21 4:50), Celso Oliveira wrote:

Hey guys,

I'm having a little problem with solr 3.4.0, when I turn on the highlight:

Its looks like that: https://issues.apache.org/jira/browse/SOLR-925 (fixed
in 1.4 version)

But now, on 3.4.0 version, I still get this error.


Your problem is not same for SOLR-925, because you use FVH.

Please open a ticket with the following info:

- schema.xml (field type and field of hl.fl)
- request url
- document data

thanks!

koji
--
http://www.rondhuit.com/en/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-12-21 Thread Shinichiro Abe (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174637#comment-13174637
 ] 

Shinichiro Abe commented on SOLR-2346:
--

I've faced the same problem. Tika parsed my Shift_JIS file as windows-1252, I 
could not see the desired results. I can index the file correctly by applying 
Koji's patch. But this patch is effective for remote streaming, not for POST. 
So, I changed a part of code below.
{noformat}  //String charset = 
ContentStreamBase.getCharsetFromContentType(stream.getContentType());
  String contentType = req.getParams().get(CommonParams.STREAM_CONTENTTYPE, 
null);
  String charset = ContentStreamBase.getCharsetFromContentType(contentType);
{noformat}

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.6, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, UnicodeSave.msg, 
 sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

60 matches

Mail list logo