date:20120720

Lance Norskog created SOLR-3653:
---

 Summary: Support Smart Simplified Chinese in Solr - include 
clean-up bigramming filter
 Key: SOLR-3653
 URL: https://issues.apache.org/jira/browse/SOLR-3653
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Lance Norskog


The Smart Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr 
factories. Also, since it is a statistical algorithm, it is not perfect.

This patch supplies factories and a schema.xml type for the existing Lucene 
Smart Chinese implementation, and includes a fixup class to handle the 
occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter


 [ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3653:


Attachment: SmartChineseType.pdf

 Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
 -

 Key: SOLR-3653
 URL: https://issues.apache.org/jira/browse/SOLR-3653
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Lance Norskog
 Attachments: SOLR-3653.patch, SmartChineseType.pdf


 The Smart Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr 
 factories. Also, since it is a statistical algorithm, it is not perfect.
 This patch supplies factories and a schema.xml type for the existing Lucene 
 Smart Chinese implementation, and includes a fixup class to handle the 
 occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter


 [ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3653:


Attachment: SOLR-3653.patch

 Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
 -

 Key: SOLR-3653
 URL: https://issues.apache.org/jira/browse/SOLR-3653
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Lance Norskog
 Attachments: SOLR-3653.patch, SmartChineseType.pdf


 The Smart Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr 
 factories. Also, since it is a statistical algorithm, it is not perfect.
 This patch supplies factories and a schema.xml type for the existing Lucene 
 Smart Chinese implementation, and includes a fixup class to handle the 
 occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418997#comment-13418997
]

Lance Norskog commented on SOLR-3653:
-

The SmartChineseWordTokenFilter is a statistical algorithm (Hidden Markov Model
to be exact) which was trained on a corpus of training text. It's purpose is to
split text into words, which are singles, bigrams and occasionally trigrams
of Simplified Chinese ideograms (letters). It does a very good job, but since
it is statistically based it is not perfect. When it fails, it emits words
that are 4 or more ideograms. These are really phrases. These phrases contain
real words which should be searchable.

The attached PDF of the Analysis page shows the problem. Chinese legal text
proved a pathological case and created a 7-ideogram word. In order to make
parts of this text searchable, the 7-letter phrase has to be broken into
n-grams. Unigrams give more recall while bigrams give more precision.

This patch includes a new SmartChineseBigramFilter takes any words not split by
the WordTokenFilter and creates bigrams from them. The bigrams only span the
unsplit phrase. They do not overlap between two adjoining unsplit phrases. The
attached PDF shows this effect as well between the first and second unsplit
phrases.

I am not an expert on the Chinese language or the HMM technology used in the
Smart Chinese toolkit. I created the bigram filter after difficulties
attempting to supply a high-quality search experience for Chinese legal
documents. This is a straw-man solution to the problem. If you know better,
please say so and we will iterate.

The patch includes a 'text_zh' field type which includes the bigram filter. The
bigram filter is essential if 'text_zh' is to be the preferred recommendation.

Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
-

Key: SOLR-3653
URL: https://issues.apache.org/jira/browse/SOLR-3653
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Reporter: Lance Norskog
Attachments: SOLR-3653.patch, SmartChineseType.pdf

The Smart Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr
factories. Also, since it is a statistical algorithm, it is not perfect.
This patch supplies factories and a schema.xml type for the existing Lucene
Smart Chinese implementation, and includes a fixup class to handle the
occasional mistake made by the Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-4.x-Windows-Java6-64 - Build # 384 - Failure!

2012-07-20 Thread Policeman Jenkins Server

Build: 
http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Windows-Java6-64/384/

1 tests failed.
FAILED:  junit.framework.TestSuite.org.apache.solr.cloud.CloudStateUpdateTest

Error Message:
ERROR: SolrIndexSearcher opens=5 closes=4

Stack Trace:
java.lang.AssertionError: ERROR: SolrIndexSearcher opens=5 closes=4
at __randomizedtesting.SeedInfo.seed([FD5FC09E1D433B61]:0)
at org.junit.Assert.fail(Assert.java:93)
at 
org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:216)
at org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:82)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1995)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.access$1100(RandomizedRunner.java:132)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:754)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleReportUncaughtExceptions$1.evaluate(TestRuleReportUncaughtExceptions.java:68)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at 
org.apache.lucene.util.TestRuleIcuHack$1.evaluate(TestRuleIcuHack.java:51)
at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
at 
org.apache.lucene.util.TestRuleNoInstanceHooksOverrides$1.evaluate(TestRuleNoInstanceHooksOverrides.java:53)
at 
org.apache.lucene.util.TestRuleNoStaticHooksShadowing$1.evaluate(TestRuleNoStaticHooksShadowing.java:52)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:36)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:605)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.access$400(RandomizedRunner.java:132)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:551)




Build Log:
[...truncated 15236 lines...]
[junit4:junit4] Suite: org.apache.solr.cloud.CloudStateUpdateTest
[junit4:junit4] (@AfterClass output)
[junit4:junit4]   2 22468 T1039 oas.SolrTestCaseJ4.deleteCore ###deleteCore
[junit4:junit4]   2 147384 T1039 oas.SolrTestCaseJ4.endTrackingSearchers 
SEVERE ERROR: SolrIndexSearcher opens=5 closes=4
[junit4:junit4]   2 NOTE: test params are: codec=Appending, 
sim=RandomSimilarityProvider(queryNorm=true,coord=false): {}, locale=es_PA, 
timezone=America/Cordoba
[junit4:junit4]   2 NOTE: Windows 7 6.1 amd64/Sun Microsystems Inc. 1.6.0_33 
(64-bit)/cpus=2,threads=3,free=67579512,total=195362816
[junit4:junit4]   2 NOTE: All tests run in this JVM: [DocumentBuilderTest, 
TestCollationKeyRangeQueries, DistributedTermsComponentTest, TestQueryUtils, 
TestCharFilters, TestElisionFilterFactory, UUIDFieldTest, 
TestJapaneseBaseFormFilterFactory, SolrRequestParserTest, TestUpdate, 
FastVectorHighlighterTest, TestPatternReplaceFilterFactory, TestRangeQuery, 
DocumentAnalysisRequestHandlerTest, TestSwedishLightStemFilterFactory, 
CircularListTest, TestPorterStemFilterFactory, RAMDirectoryFactoryTest, 
DateMathParserTest, DistributedSpellCheckComponentTest, TestQuerySenderNoQuery, 
UpdateRequestProcessorFactoryTest, ShowFileRequestHandlerTest, 
TestHungarianLightStemFilterFactory, TestIrishLowerCaseFilterFactory, 
DebugComponentTest, BasicDistributedZkTest, TestCoreContainer, 
ZkSolrClientTest, SnowballPorterFilterFactoryTest, 
TestReversedWildcardFilterFactory, SpellCheckCollatorTest, TestOmitPositions, 
SOLR749Test, TestSlowSynonymFilter, TestSort, UpdateParamsTest, 
TestQuerySenderListener, LegacyHTMLStripCharFilterTest, 
CommonGramsQueryFilterFactoryTest, CloudStateTest, 
TestLMDirichletSimilarityFactory, TestLatvianStemFilterFactory, 
BadComponentTest, SpellPossibilityIteratorTest, 
TestPortugueseMinimalStemFilterFactory, SortByFunctionTest, 
SolrCoreCheckLockOnStartupTest, TestRemoveDuplicatesTokenFilterFactory, 
NotRequiredUniqueKeyTest, ScriptEngineTest, TestPHPSerializedResponseWriter, 
FullSolrCloudDistribCmdsTest, TestPatternReplaceCharFilterFactory, 
TestNorwegianMinimalStemFilterFactory, TestKeepFilterFactory, 
SpatialFilterTest,

[jira] [Updated] (SOLR-3618) Enable replication of master using proxy settings

2012-07-20 Thread Gautier Koscielny (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautier Koscielny updated SOLR-3618:


Attachment: SnapPuller.java.patch

I've modified the createHttpClient method to take proxy settings into account.
The HttpClient instance is created as before and then proxy settings are added 
to the host configuration if required.

 Enable replication of master using proxy settings
 -

 Key: SOLR-3618
 URL: https://issues.apache.org/jira/browse/SOLR-3618
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Affects Versions: 3.6.1
Reporter: Gautier Koscielny
  Labels: patch
 Fix For: 3.6.1

 Attachments: SnapPuller.java.patch

   Original Estimate: 4h
  Remaining Estimate: 4h

 Check whether system properties http.proxyHost and http.proxyPort are set 
 to initialize the httpClient instance properly in the SnapPuller class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3167) Allow running embedded zookeeper 1 for 1 dynamically with solr nodes


[ 
https://issues.apache.org/jira/browse/SOLR-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419015#comment-13419015
 ] 

Jan Høydahl commented on SOLR-3167:
---

I was thinking auto-everything by default :) like ElasticSearch

# Start Solr on a node without any options other than telling it to start in 
cloud mode
## If -DzkHost is not specified it will try autoDiscover (through some 0-conf 
protocol) and join existing ZK
## If no existing ZK found, spin up a local one
# Start Solr on another node, it will discover the existing one(s) without any 
host:port at startup
## If too few ZK servers, will start another one and refresh the ZK list on 
all other nodes
## If enough ZK servers already, will simply join. Should also be possible to 
auto-start ZK on another node if one master has failed.

 Allow running embedded zookeeper 1 for 1 dynamically with solr nodes
 

 Key: SOLR-3167
 URL: https://issues.apache.org/jira/browse/SOLR-3167
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller

 Right now you have to decide which nodes run zookeeper up front - each node 
 must know the list of all the servers in the ensemble. Growing or shrinking 
 the list of nodes requires a rolling restart. 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1355 (Add 
 zk.updateServerList(newServerList) might be able to help us here. Perhaps the 
 over seer could make a call to each replica when the list changes and use the 
 update server list call.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3167) Allow running embedded zookeeper 1 for 1 dynamically with solr nodes

2012-07-20 Thread Raju (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419016#comment-13419016
 ] 

Raju commented on SOLR-3167:


hi


 Allow running embedded zookeeper 1 for 1 dynamically with solr nodes
 

 Key: SOLR-3167
 URL: https://issues.apache.org/jira/browse/SOLR-3167
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller

 Right now you have to decide which nodes run zookeeper up front - each node 
 must know the list of all the servers in the ensemble. Growing or shrinking 
 the list of nodes requires a rolling restart. 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1355 (Add 
 zk.updateServerList(newServerList) might be able to help us here. Perhaps the 
 over seer could make a call to each replica when the list changes and use the 
 update server list call.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4224) Simplify MultiValuedCase in TermsIncludingScoreQuery

2012-07-20 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen reassigned LUCENE-4224:
-

Assignee: Martijn van Groningen

 Simplify MultiValuedCase in TermsIncludingScoreQuery
 

 Key: LUCENE-4224
 URL: https://issues.apache.org/jira/browse/LUCENE-4224
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
Assignee: Martijn van Groningen
 Attachments: LUCENE-4224.patch


 While looking at LUCENE-4214, i was trying to wrap my head around what this 
 is doing... 
 I think the code specialization in the multivalued scorer doesn't buy us any 
 additional speed? At least according to my benchmarks?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419032#comment-13419032
]

Markus Jelsma commented on SOLR-1781:
-

Hi, is the core reloading still part of this? I get a lot of firstSearcher
events on a test node now and it won't get online. Going back to July 18th
(before this patch) build works fine. Other nodes won't come online with a
build from the 19th (after this patch).

Replication index directories not always cleaned up
---

Key: SOLR-1781
URL: https://issues.apache.org/jira/browse/SOLR-1781
Project: Solr
Issue Type: Bug
Components: replication (java), SolrCloud
Affects Versions: 1.4
Environment: Windows Server 2003 R2, Java 6b18
Reporter: Terje Sten Bjerkseth
Assignee: Mark Miller
Fix For: 4.0, 5.0

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

We had the same problem as someone described in
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201001.mbox/%3c222a518d-ddf5-4fc8-a02a-74d4f232b...@snooth.com%3e.
A partial copy of that message:
We're using the new replication and it's working pretty well. There's
one detail I'd like to get some more information about.
As the replication works, it creates versions of the index in the data
directory. Originally we had index/, but now there are dated versions
such as index.20100127044500/, which are the replicated versions.
Each copy is sized in the vicinity of 65G. With our current hard drive
it's fine to have two around, but 3 gets a little dicey. Sometimes
we're finding that the replication doesn't always clean up after
itself. I would like to understand this better, or to not have this
happen. It could be a configuration issue.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser

2012-07-20 Thread Karsten R. (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karsten R. updated LUCENE-4109:
---

Attachment: LUCENE-4109.patch

Patch for lucene/contrib against 
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6
The patch adds the Processor BooleanQuery2ModifierNodeProcessor.
The patch also changes ParametricRangeQueryNodeProcessor as hotfix for 
LUCENE-3338 (this change is not for 4.X because LUCENE-3338 is already fixed in 
4.X). 
The patch passes all tests from QueryParserTestBase except  
{{{assertQueryEquals([\\* TO \*\],null,[\\* TO \\*]);}}}
and LUCENE-2566 related tests.
Patch for trunk will coming soon.

 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

[
https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Grant Ingersoll updated LUCENE-3151:

Attachment: LUCENE-3151.patch

Here's a first draft at this. The packaging looks more or less right, but I
haven't fully tested it yet. The main downsides to this approach are:
# Minor loss of Javadoc due to references to things like IndexWriter,
DoubleField, etc. I kept the references, just removed the @link, which allowed
me to drop the import statement
# We need to somehow document that this jar is for standalone use only. It's
probably a minor issue, but going forward, people could get into classloader
hell with this if they are mixing versions. Of course, that's always the case
in Java, so caveat emptor.

Make all of Analysis completely independent from Lucene Core

Key: LUCENE-3151
URL: https://issues.apache.org/jira/browse/LUCENE-3151
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 4.0-ALPHA
Reporter: Grant Ingersoll
Fix For: 4.1

Attachments: LUCENE-3151.patch, LUCENE-3151.patch

Lucene's analysis package, including the definitions of Attribute,
TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout
uses them) for text processing. I'd like to move the definitions, or at
least their packaging, to a separate JAR file so that one can consume them
w/o needing Lucene core. My draft idea is to have a definition area that
Lucene core is dependent on and the rest of the analysis package can then be
dependent on the definition area. (I'm open to other ideas as well)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core


[ 
https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419045#comment-13419045
 ] 

Grant Ingersoll commented on LUCENE-3151:
-

I should add: to run this, for now, do {code}ant jar-analyzer-definition{code}. 
 Still need to make sure it fully hooks into the rest of the build correctly, 
too.

 Make all of Analysis completely independent from Lucene Core
 

 Key: LUCENE-3151
 URL: https://issues.apache.org/jira/browse/LUCENE-3151
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA
Reporter: Grant Ingersoll
 Fix For: 4.1

 Attachments: LUCENE-3151.patch, LUCENE-3151.patch


 Lucene's analysis package, including the definitions of Attribute, 
 TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout 
 uses them) for text processing.  I'd like to move the definitions, or at 
 least their packaging, to a separate JAR file so that one can consume them 
 w/o needing Lucene core.  My draft idea is to have a definition area that 
 Lucene core is dependent on and the rest of the analysis package can then be 
 dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: top level ant test shouldn't validate?

2012-07-20 Thread Erick Erickson

+1 on the original question...

IntelliJ doesn't seem to have the problem, I run ant clean from the
top level all the time and my projects that depend on it seem to work
fine.

I vaguely remember in Eclipse having to do something like a project
refresh to get things back in synch, but that may be unrelated


On Thu, Jul 19, 2012 at 8:56 PM, Mark Miller markrmil...@gmail.com wrote:
 Top level any clean breaks my IDE too! I don't know the fine points of this 
 conversation, but it's super painful and I never call top level ant clean 
 anymore. I kept meaning to look into why it was killing me but never got to 
 it.

 Sent from my iPhone

 On Jul 19, 2012, at 12:46 PM, Robert Muir rcm...@gmail.com wrote:

 +1, we have caged the rat, we should be able to have a simple precommit 
 check.

 also top-level 'ant clean' shouldn't call clean-jars.

 This *totally messes up* my IDE just because I like to run tests from
 the command-line.

 On Thu, Jul 19, 2012 at 12:40 PM, Steven A Rowe sar...@syr.edu wrote:
 On 7/19/2012 at 12:35 PM, Michael McCandless wrote:
 Any objections to fixing top level ant test to simply run tests...?

 Maybe we can add a precommit target to run tests, validate, 
 javadocs-lint, ...

 +1

 Steve

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3654) Add some tests using Tomcat as servlet container

Jan Høydahl created SOLR-3654:
-

 Summary: Add some tests using Tomcat as servlet container
 Key: SOLR-3654
 URL: https://issues.apache.org/jira/browse/SOLR-3654
 Project: Solr
  Issue Type: Task
  Components: Build
 Environment: Tomcat
Reporter: Jan Høydahl
 Fix For: 4.0


All tests use Jetty, we should add some tests for at least one other servlet 
container (Tomcat). Ref discussion at http://search-lucene.com/m/6mo9Y1WZaWR1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


 [ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4227:
---

Attachment: LUCENE-4227.patch

New patch, fixing previous nocommits / downgrading to TODOs.  I also removed 
the specialized scorers since they seem not to help much.

All tests pass, but I still need to fix all tests that now avoid MemoryPF to 
also avoid DirectPF.  Otherwise I think it's ready...

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3654) Add some tests using Tomcat as servlet container


[ 
https://issues.apache.org/jira/browse/SOLR-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419068#comment-13419068
 ] 

Jan Høydahl commented on SOLR-3654:
---

Have not done a lot of investigation but we could probably build this test 
using Cargo: http://cargo.codehaus.org/Ant+support

Having got Tomcat test support, it should be trivial to add other supported 
containers as well [Geronimo, Glassfish, JBoss, Resin...]. Also, if all Jetty 
tests now use {{JettySolrRunner}}, we have a single point of entry to plug in 
container randomization further down the road. This could be controlled by 
options so that Jetty is default but nightly builds randomize container per run.

 Add some tests using Tomcat as servlet container
 

 Key: SOLR-3654
 URL: https://issues.apache.org/jira/browse/SOLR-3654
 Project: Solr
  Issue Type: Task
  Components: Build
 Environment: Tomcat
Reporter: Jan Høydahl
  Labels: Tomcat
 Fix For: 4.0


 All tests use Jetty, we should add some tests for at least one other servlet 
 container (Tomcat). Ref discussion at http://search-lucene.com/m/6mo9Y1WZaWR1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Discuss] Should Solr be an AppServer agnostic WAR or require Jetty?

2012-07-20 Thread Jan Høydahl

I've created SOLR-3654 as a placeholder for adding tests using Tomcat (and 
possibly other).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 16. juli 2012, at 23:52, Chris Hostetter wrote:

 
 :  Specificly: it would be a terrible idea to try and rush a change like 
 this 
 :  in before Solr 4.0-FINAL ...
 : 
 : That's just a silly premise - no one in this conversation even remotely 
 : suggested we stop using webapps or wars for Solr 4.0-FINAL. Switching to 
 : non webapp tech is 'probably' a bit of work.
 
 Right ... I didn't get the impression anyone who had spoken up so far was 
 suggestiong a change like this for Solr 4.0-FINAL.  I just wanted to 
 state that while i have very little opinion about *if* we should make a 
 change like this, i have strong opinions about *when* we should try to 
 make a change like this, if the discussion does go in that direction.
 
 
 
 
 -Hoss
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419071#comment-13419071
 ] 

Robert Muir commented on LUCENE-4227:
-

Would it really be that much slower if it was slightly more reasonable, e.g. 
storing freqs
in packed ints (with huper-duper fast options) instead of wasting so much on 
them?


 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3636) edismax, synonyms and mm=100%


 [ 
https://issues.apache.org/jira/browse/SOLR-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-3636:
--

  Component/s: query parsers
Fix Version/s: 5.0
   4.0

 edismax, synonyms and mm=100%
 -

 Key: SOLR-3636
 URL: https://issues.apache.org/jira/browse/SOLR-3636
 Project: Solr
  Issue Type: Bug
  Components: query parsers
Reporter: Lance Norskog
Priority: Minor
 Fix For: 4.0, 5.0


 There is a problem with query-side synonyms, edismax and must-match=100%. 
 edismax interprets must-match=100% as number of terms found by edismax from 
 the original query. These terms go through the query analyzer, and the 
 synonym filter creates more terms, *but* the must-match term count is not 
 incremented. Thus, given a synonym of
 {code}
 monkeyhouse = monkey house
 {code}
 the query {{q=big+monkeyhousemm=100%}} becomes (effectively) 
 {{q=big+monkey+housemm=2}}. This query finds documents matching only two out 
 of three terms {{big+monkey, monkey+house, big+house}}.
 This might also be a problem in dismax.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-3646) /browse request handler fails in example if you don't specify a field in the query with no default specified via 'df' param

2012-07-20 Thread Erick Erickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson resolved SOLR-3646.
--

Resolution: Fixed

4x: 1363747
trunk: 1363751

 /browse request handler fails in example if you don't specify a field in the 
 query with no default specified via 'df' param
 -

 Key: SOLR-3646
 URL: https://issues.apache.org/jira/browse/SOLR-3646
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 4.0, 5.0
Reporter: Erick Erickson
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3646.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 If you try using the stock /browse request handler and don't specify a field 
 in the search, you get the following stack (partial):
 SEVERE: org.apache.solr.common.SolrException: no field name specified in 
 query and no default specified via 'df' param
   at 
 org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:136)
   at 
 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:154)
   at 
 org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1063)
   at 
 org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:350)
 .
 .
 .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser


 [ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-4109:
---

Assignee: Robert Muir

 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3292) /browse example fails to load on 3x: no field name specified in query and no default specified via 'df' param

2012-07-20 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419089#comment-13419089
 ] 

Erick Erickson commented on SOLR-3292:
--

I just fixed this in 4.x and trunk, can we close this one?

 /browse example fails to load on 3x: no field name specified in query and no 
 default specified via 'df' param
 ---

 Key: SOLR-3292
 URL: https://issues.apache.org/jira/browse/SOLR-3292
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Hoss Man
Priority: Blocker
 Fix For: 3.6, 4.0, 5.0


 1) java -jar start.jar using solr example on 3x branch circa r1306629
 2) load http://localhost:8983/solr/browse
 3) browser error: 400 no field name specified in query and no default 
 specified via 'df' param
 4) error in logs...
 {noformat}
 INFO: [] webapp=/solr path=/browse params={} hits=0 status=400 QTime=3 
 Mar 28, 2012 4:05:59 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: no field name specified in 
 query and no default specified via 'df' param
   at 
 org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:158)
   at 
 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:174)
   at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1429)
   at 
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1317)
   at 
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1245)
   at 
 org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1234)
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:79)
   at org.apache.solr.search.QParser.getQuery(QParser.java:143)
   at 
 org.apache.solr.request.SimpleFacets.getFacetQueryCounts(SimpleFacets.java:233)
   at 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:194)
   at 
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
   at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser


 [ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4109:


Attachment: LUCENE-4109.patch

Patch looks good to me!

I also added Daniels test from buzzwords.

Thanks for fixing this, and adding additional tests! Once 3.6 branch is open 
I'll get it in.

 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419096#comment-13419096
]

Robert Muir commented on SOLR-3653:
---

{quote}
The Smart Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr
factories
{quote}

Actually there are factories in contrib/analysis-extras.

{quote}
and includes a fixup class to handle the occasional mistake made by the Smart
Chinese implementation.
{quote}

I am not sure on this: if someone wants to mix an n-gram technique with a word
model, they can just
use two fields? If they want to limit the n-gram field to only longer terms,
they should use LengthFilter.

Furthermore, I don't really understand the problem here.
The word you are upset about (中华人民共和国) is in the smartcn dictionary. As I
understand, this word basically means PRC.
This is a single concept and makes sense as an indexing unit. Why do we care
how long it is in characters?

Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3654) Add some tests using Tomcat as servlet container


[ 
https://issues.apache.org/jira/browse/SOLR-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419097#comment-13419097
 ] 

Mark Miller commented on SOLR-3654:
---

I'm 100% against this. 

 Add some tests using Tomcat as servlet container
 

 Key: SOLR-3654
 URL: https://issues.apache.org/jira/browse/SOLR-3654
 Project: Solr
  Issue Type: Task
  Components: Build
 Environment: Tomcat
Reporter: Jan Høydahl
  Labels: Tomcat
 Fix For: 4.0


 All tests use Jetty, we should add some tests for at least one other servlet 
 container (Tomcat). Ref discussion at http://search-lucene.com/m/6mo9Y1WZaWR1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4224) Simplify MultiValuedCase in TermsIncludingScoreQuery

2012-07-20 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-4224:
--

Attachment: LUCENE-4224.patch

Attached a new patch.
* Added a Scorer that scores in order.
* Existing throw a UOE in the advance() method.

 Simplify MultiValuedCase in TermsIncludingScoreQuery
 

 Key: LUCENE-4224
 URL: https://issues.apache.org/jira/browse/LUCENE-4224
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
Assignee: Martijn van Groningen
 Attachments: LUCENE-4224.patch, LUCENE-4224.patch


 While looking at LUCENE-4214, i was trying to wrap my head around what this 
 is doing... 
 I think the code specialization in the multivalued scorer doesn't buy us any 
 additional speed? At least according to my benchmarks?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419099#comment-13419099
]

Mark Miller commented on SOLR-1781:
---

No, no reload. Can you please elaborate on what not going online means. Can you
share logs?

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2366) Facet Range Gaps


[ 
https://issues.apache.org/jira/browse/SOLR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419105#comment-13419105
 ] 

Jan Høydahl commented on SOLR-2366:
---

Mandar, since this patch is Unresolved, the feature is not part of any version 
(yet), there are only patches attached, which may not apply cleanly if they are 
old.

 Facet Range Gaps
 

 Key: SOLR-2366
 URL: https://issues.apache.org/jira/browse/SOLR-2366
 Project: Solr
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 4.1

 Attachments: SOLR-2366.patch, SOLR-2366.patch


 There really is no reason why the range gap for date and numeric faceting 
 needs to be evenly spaced.  For instance, if and when SOLR-1581 is completed 
 and one were doing spatial distance calculations, one could facet by function 
 into 3 different sized buckets: walking distance (0-5KM), driving distance 
 (5KM-150KM) and everything else (150KM+), for instance.  We should be able to 
 quantize the results into arbitrarily sized buckets.
 (Original syntax proposal removed, see discussion for concrete syntax)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core


[ 
https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419106#comment-13419106
 ] 

Robert Muir commented on LUCENE-3151:
-

Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
change into a separate issue?
This would be nice to fix ASAP.

I dont understand why it takes IndexableField or took Fieldable. All the other 
methods here like
getPositionIncrementGap take String fieldName. I think this one should too.

I dont think it needs a boolean for tokenized either: returning a 0 for 
NOT_ANALYZED fields. 
If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!

If you want to do expert stuff control the offset gaps between values for 
NOT_ANALYZED fields, 
then just analyze it instead, with keyword tokenizer!



 Make all of Analysis completely independent from Lucene Core
 

 Key: LUCENE-3151
 URL: https://issues.apache.org/jira/browse/LUCENE-3151
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA
Reporter: Grant Ingersoll
 Fix For: 4.1

 Attachments: LUCENE-3151.patch, LUCENE-3151.patch


 Lucene's analysis package, including the definitions of Attribute, 
 TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout 
 uses them) for text processing.  I'd like to move the definitions, or at 
 least their packaging, to a separate JAR file so that one can consume them 
 w/o needing Lucene core.  My draft idea is to have a definition area that 
 Lucene core is dependent on and the rest of the analysis package can then be 
 dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene on iOS

2012-07-20 Thread Jan Høydahl

Hi,

This mailing list is only for the main Java based Lucene library. Please ask 
your question to S4LuceneLibrary directly, which seems to be a completely 
independent port.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 19. juli 2012, at 16:14, Tobias Buchholz wrote:

 Hello, 
 
 I'm Tobias Buchholz and a student at HTW in Berlin. For my bachelor thesis 
 I'm trying to improve a search algorithm of an iOS app, which offers some 
 magazines with a lot of articles. To do that I used the S4LuceneLibrary of 
 Micheal Papp (https://github.com/mikekppp/S4LuceneLibrary), which is an iOS 
 equivalent to the full-featured text search engine library of Apache Lucene. 
 The problem is, that the search now is very inconsistent...that means the 
 search after specific words takes sometimes very long and on the other hand 
 sometimes not.
 
 That's a list of words, I was searching for, and the time the search took:
 
 Berlin (34 hits) - 2,8 seconds
 Tag (29 hits) - 11,8 seconds
 Haus (3 hits) - 7,1 seconds
 Straße (28 hits) - 15,7 seconds
 Raumfahrt (5 hits) - 13,8 seconds
 Astronomie (9 hits) - 6 second
 So the results are quite different, but I thought it should take the same 
 time for every search phrase. Do you have an idea why is that?
 
 Thanks in advance!
 
 Best Regards,
 Tobias Buchholz

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419114#comment-13419114
]

Markus Jelsma commented on SOLR-1781:
-

The node will never respond to HTTP requests, all ZK connections time out, very
high resource consumption. I'll try provide a log snippet soon. I tried running
today's build several times but one specific node refuses to `come online`.
Another node did well and runs today's build.

I cannot attach a file to a resolved issue. Send over mail?

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller reopened SOLR-1781:
---

Ill reopen - email is fine as well.

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419119#comment-13419119
 ] 

Michael McCandless commented on LUCENE-4227:


{quote}
Would it really be that much slower if it was slightly more reasonable, e.g. 
storing freqs
 in packed ints (with huper-duper fast options) instead of wasting so much on 
them?
{quote}

Probably not that much slower?  I think that's a good idea!

But I think we can explore this after committing?  There are other things we 
can try too (eg collapse skip list into shared int[]: I think this one may give 
a perf gain, collapse positions, etc.).


 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser


[ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419122#comment-13419122
 ] 

Robert Muir commented on LUCENE-4109:
-

Hmm, with the patch some tests for TestMultiFieldQPHelper fail.
I didn't look into it further, but we should figure out whats going on there if 
we can.

 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419126#comment-13419126
 ] 

Robert Muir commented on LUCENE-4227:
-

Yeah, i don't think we need to solve it before committing.

I do think maybe this class needs some more warnings, to me it seems it will 
use crazy amounts of RAM.
I also am not sure I like the name Direct... is it crazy to suggest 
Instantiated?

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419127#comment-13419127
]

Markus Jelsma commented on SOLR-1781:
-

Log sent.

This node has two shards on it and executed 2x 512 warmup queries which adds
up. It won't talk to ZK (tail of the log). Restarting the node with an 18th's
build works fine. Did it three times today.
Thanks

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene on iOS

2012-07-20 Thread Tobias Buchholz

Yes okay,
I was hoping there could be a similar problem at the java lucene library
with inconsistent search times, so the solution for that could help me as
well.

On Fri, Jul 20, 2012 at 3:03 PM, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 This mailing list is only for the main Java based Lucene library. Please
 ask your question to S4LuceneLibrary directly, which seems to be a
 completely independent port.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 19. juli 2012, at 16:14, Tobias Buchholz wrote:

 Hello,

 I'm Tobias Buchholz and a student at HTW in Berlin. For my bachelor thesis
 I'm trying to improve a search algorithm of an iOS app, which offers some
 magazines with a lot of articles. To do that I used the S4LuceneLibrary of
 Micheal Papp (https://github.com/mikekppp/S4LuceneLibrary), which is an
 iOS equivalent to the full-featured text search engine library of Apache
 Lucene. The problem is, that the search now is very inconsistent...that
 means the search after specific words takes sometimes very long and on the
 other hand sometimes not.

 That's a list of words, I was searching for, and the time the search took:

- Berlin (34 hits) - 2,8 seconds
- Tag (29 hits) - 11,8 seconds
- Haus (3 hits) - 7,1 seconds
- Straße (28 hits) - 15,7 seconds
- Raumfahrt (5 hits) - 13,8 seconds
- Astronomie (9 hits) - 6 second

 So the results are quite different, but I thought it should take the same
 time for every search phrase. Do you have an idea why is that?

 Thanks in advance!

 Best Regards,
 Tobias Buchholz

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419129#comment-13419129
 ] 

Michael McCandless commented on LUCENE-4227:


bq. I do think maybe this class needs some more warnings, to me it seems it 
will use crazy amounts of RAM.

I'll add some scary warnings :)

bq. I also am not sure I like the name Direct... is it crazy to suggest 
Instantiated?

It is very much like the old instantiated (though I think its terms dict is 
faster than instantiated's)... but I didn't really like the name 
Instanstiated... I had picked Direct because it directly represents the 
postings ... but maybe we can find a better name.

I will update MIGRATE.txt to explain how Direct (or whatever we name it) is 
the closest match if you were previously using Instantiated...



 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419131#comment-13419131
 ] 

Robert Muir commented on LUCENE-4227:
-

{quote}
It is very much like the old instantiated (though I think its terms dict is 
faster than instantiated's)... but I didn't really like the name 
Instanstiated... I had picked Direct because it directly represents the 
postings ... but maybe we can find a better name.
{quote}

OK, I think what would be better is a better synonym for Uncompressed. I 
realized Direct is consistent with packedints
or whatever... but I don't think it should using this name either, its not 
intuitive.

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4224) Simplify MultiValuedCase in TermsIncludingScoreQuery

2012-07-20 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419101#comment-13419101
 ] 

Martijn van Groningen edited comment on LUCENE-4224 at 7/20/12 1:46 PM:


Attached a new patch.
* Added a Scorer that scores in order.
* Existing scorer throws an UOE in the advance() method.

  was (Author: martijn.v.groningen):
Attached a new patch.
* Added a Scorer that scores in order.
* Existing throw a UOE in the advance() method.
  
 Simplify MultiValuedCase in TermsIncludingScoreQuery
 

 Key: LUCENE-4224
 URL: https://issues.apache.org/jira/browse/LUCENE-4224
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
Assignee: Martijn van Groningen
 Attachments: LUCENE-4224.patch, LUCENE-4224.patch


 While looking at LUCENE-4214, i was trying to wrap my head around what this 
 is doing... 
 I think the code specialization in the multivalued scorer doesn't buy us any 
 additional speed? At least according to my benchmarks?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3892:
--

Attachment: LUCENE-3892-blockFor-with-packedints.patch

An initial try with PackedInts in current trunk version. I replaced all the 
int[] buffer with long[] buffer so we can use the API directly. I don't quite 
understand the Writer part, so we have to save each long value one by one.

However, it is the Reader part we are concerned:
{format}
TaskQPS base StdDev base QPS packedStdDev packed  Pct 
diff
 AndHighHigh   29.601.56   23.780.51  -25% -  
-13%
  AndHighMed   74.683.92   53.152.31  -35% -  
-21%
  Fuzzy1   88.231.21   87.131.41   -4% -
1%
  Fuzzy2   30.090.45   29.470.47   -5% -
1%
  IntNRQ   41.963.88   38.162.48  -22% -
6%
  OrHighHigh   17.560.34   15.450.15  -14% -   
-9%
   OrHighMed   34.710.76   30.770.53  -14% -   
-7%
PKLookup  111.001.90  110.521.59   -3% -
2%
  Phrase9.030.237.620.41  -22% -   
-8%
 Prefix3  123.568.42  110.945.43  -20% -
1%
 Respell  102.371.11  101.791.38   -2% -
1%
SloppyPhrase3.970.193.520.07  -17% -   
-4%
SpanNear8.240.187.220.25  -17% -   
-7%
Term   45.163.15   37.472.32  -27% -   
-5%
TermBGroup1M   17.191.09   15.860.77  -17% -
3%
  TermBGroup1M1P   23.471.66   20.431.16  -23% -   
-1%
 TermGroup1M   19.201.14   17.730.84  -16% -
2%
Wildcard   42.753.27   36.751.96  -24% -   
-1%
{format}

Maybe we should try PACKED_SINGLE_BLOCK for some special value of numBits, 
instead of using PACKED all the time?

Thanks to Adrien, we have a more direct API in LUCENE-4239, I'm trying that now.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419144#comment-13419144
 ] 

Han Jiang edited comment on LUCENE-3892 at 7/20/12 1:52 PM:


An initial try with PackedInts in current trunk version. I replaced all the 
int[] buffer with long[] buffer so we can use the API directly. I don't quite 
understand the Writer part, so we have to save each long value one by one.

However, it is the Reader part we are concerned:
{noformat}
TaskQPS base StdDev base QPS packedStdDev packed  Pct 
diff
 AndHighHigh   29.601.56   23.780.51  -25% -  
-13%
  AndHighMed   74.683.92   53.152.31  -35% -  
-21%
  Fuzzy1   88.231.21   87.131.41   -4% -
1%
  Fuzzy2   30.090.45   29.470.47   -5% -
1%
  IntNRQ   41.963.88   38.162.48  -22% -
6%
  OrHighHigh   17.560.34   15.450.15  -14% -   
-9%
   OrHighMed   34.710.76   30.770.53  -14% -   
-7%
PKLookup  111.001.90  110.521.59   -3% -
2%
  Phrase9.030.237.620.41  -22% -   
-8%
 Prefix3  123.568.42  110.945.43  -20% -
1%
 Respell  102.371.11  101.791.38   -2% -
1%
SloppyPhrase3.970.193.520.07  -17% -   
-4%
SpanNear8.240.187.220.25  -17% -   
-7%
Term   45.163.15   37.472.32  -27% -   
-5%
TermBGroup1M   17.191.09   15.860.77  -17% -
3%
  TermBGroup1M1P   23.471.66   20.431.16  -23% -   
-1%
 TermGroup1M   19.201.14   17.730.84  -16% -
2%
Wildcard   42.753.27   36.751.96  -24% -   
-1%
{noformat}

Maybe we should try PACKED_SINGLE_BLOCK for some special value of numBits, 
instead of using PACKED all the time?

Thanks to Adrien, we have a more direct API in LUCENE-4239, I'm trying that now.

  was (Author: billy):
An initial try with PackedInts in current trunk version. I replaced all the 
int[] buffer with long[] buffer so we can use the API directly. I don't quite 
understand the Writer part, so we have to save each long value one by one.

However, it is the Reader part we are concerned:
{format}
TaskQPS base StdDev base QPS packedStdDev packed  Pct 
diff
 AndHighHigh   29.601.56   23.780.51  -25% -  
-13%
  AndHighMed   74.683.92   53.152.31  -35% -  
-21%
  Fuzzy1   88.231.21   87.131.41   -4% -
1%
  Fuzzy2   30.090.45   29.470.47   -5% -
1%
  IntNRQ   41.963.88   38.162.48  -22% -
6%
  OrHighHigh   17.560.34   15.450.15  -14% -   
-9%
   OrHighMed   34.710.76   30.770.53  -14% -   
-7%
PKLookup  111.001.90  110.521.59   -3% -
2%
  Phrase9.030.237.620.41  -22% -   
-8%
 Prefix3  123.568.42  110.945.43  -20% -
1%
 Respell  102.371.11  101.791.38   -2% -
1%
SloppyPhrase3.970.193.520.07  -17% -   
-4%
SpanNear8.240.187.220.25  -17% -   
-7%
Term   45.163.15   37.472.32  -27% -   
-5%
TermBGroup1M   17.191.09   15.860.77  -17% -
3%
  TermBGroup1M1P   23.471.66   20.431.16  -23% -   
-1%
 TermGroup1M   19.201.14   17.730.84  -16% -
2%
Wildcard   42.753.27   36.751.96  -24% -   
-1%
{format}

Maybe we should try PACKED_SINGLE_BLOCK for some special value of numBits, 
instead of using PACKED all the time?

Thanks to Adrien, we have a more direct API in LUCENE-4239, I'm trying that now.
  
 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12

[jira] [Created] (LUCENE-4240) Analyzer.getOffsetGap Improvements

Grant Ingersoll created LUCENE-4240:
---

 Summary: Analyzer.getOffsetGap Improvements
 Key: LUCENE-4240
 URL: https://issues.apache.org/jira/browse/LUCENE-4240
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll


From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer 
to take in an IndexableField object.  We can simplify this API:
{quote}
Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
change into a separate issue?
This would be nice to fix ASAP.

I dont understand why it takes IndexableField or took Fieldable. All the other 
methods here like
getPositionIncrementGap take String fieldName. I think this one should too.

I dont think it needs a boolean for tokenized either: returning a 0 for 
NOT_ANALYZED fields. 
If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!

If you want to do expert stuff control the offset gaps between values for 
NOT_ANALYZED fields, 
then just analyze it instead, with keyword tokenizer!
{quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core


[ 
https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419163#comment-13419163
 ] 

Grant Ingersoll commented on LUCENE-3151:
-

LUCENE-4240

 Make all of Analysis completely independent from Lucene Core
 

 Key: LUCENE-3151
 URL: https://issues.apache.org/jira/browse/LUCENE-3151
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA
Reporter: Grant Ingersoll
 Fix For: 4.1

 Attachments: LUCENE-3151.patch, LUCENE-3151.patch


 Lucene's analysis package, including the definitions of Attribute, 
 TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout 
 uses them) for text processing.  I'd like to move the definitions, or at 
 least their packaging, to a separate JAR file so that one can consume them 
 w/o needing Lucene core.  My draft idea is to have a definition area that 
 Lucene core is dependent on and the rest of the analysis package can then be 
 dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4240) Analyzer.getOffsetGap Improvements

2012-07-20 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419165#comment-13419165
]

Uwe Schindler commented on LUCENE-4240:
---

+1, nice simplification. I was always wondering about this inconsistency.
String field is enough.

Analyzer.getOffsetGap Improvements
--

Key: LUCENE-4240
URL: https://issues.apache.org/jira/browse/LUCENE-4240
Project: Lucene - Java
Issue Type: Improvement
Reporter: Grant Ingersoll

From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer
to take in an IndexableField object. We can simplify this API:
{quote}
Hey Grant: I know it sounds silly but can we split out the getOffsetGap API
change into a separate issue?
This would be nice to fix ASAP.
I dont understand why it takes IndexableField or took Fieldable. All the
other methods here like
getPositionIncrementGap take String fieldName. I think this one should too.
I dont think it needs a boolean for tokenized either: returning a 0 for
NOT_ANALYZED fields.
If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!
If you want to do expert stuff control the offset gaps between values for
NOT_ANALYZED fields,
then just analyze it instead, with keyword tokenizer!
{quote}

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


 [ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4227:
---

Attachment: LUCENE-4227.patch

New patch, adding scary warning  MIGRATE.txt entry, fixing javadoc errors, and 
adding lucene.experimental ... still haven't thought of another name yet ...

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


[ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419173#comment-13419173
 ] 

Robert Muir commented on LUCENE-4227:
-

I dont have better name either. Lets just commit it with this one and think 
about it for later!

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-4227.patch, LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4238) NRTCachingDirectory has concurrency bug(s).


[ 
https://issues.apache.org/jira/browse/LUCENE-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419174#comment-13419174
 ] 

Michael McCandless commented on LUCENE-4238:


Hi Mark, which test/seed are you seeing this on?

 NRTCachingDirectory has concurrency bug(s).
 ---

 Key: LUCENE-4238
 URL: https://issues.apache.org/jira/browse/LUCENE-4238
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/store
Reporter: Mark Miller
 Fix For: 4.0, 5.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: top level ant test shouldn't validate?

2012-07-20 Thread Robert Muir

That makes two of us. I'm gonna disable this as it just broke my IDE
just now again and I'm over the edge!

On Thu, Jul 19, 2012 at 8:56 PM, Mark Miller markrmil...@gmail.com wrote:
 Top level any clean breaks my IDE too! I don't know the fine points of this 
 conversation, but it's super painful and I never call top level ant clean 
 anymore. I kept meaning to look into why it was killing me but never got to 
 it.

 Sent from my iPhone

 On Jul 19, 2012, at 12:46 PM, Robert Muir rcm...@gmail.com wrote:

 +1, we have caged the rat, we should be able to have a simple precommit 
 check.

 also top-level 'ant clean' shouldn't call clean-jars.

 This *totally messes up* my IDE just because I like to run tests from
 the command-line.

 On Thu, Jul 19, 2012 at 12:40 PM, Steven A Rowe sar...@syr.edu wrote:
 On 7/19/2012 at 12:35 PM, Michael McCandless wrote:
 Any objections to fixing top level ant test to simply run tests...?

 Maybe we can add a precommit target to run tests, validate, 
 javadocs-lint, ...

 +1

 Steve

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4227) DirectPostingsFormat, storing postings as simple int[] in memory, if you have tons of RAM


 [ 
https://issues.apache.org/jira/browse/LUCENE-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-4227.


   Resolution: Fixed
Fix Version/s: 5.0
   4.0

 DirectPostingsFormat, storing postings as simple int[] in memory, if you have 
 tons of RAM
 -

 Key: LUCENE-4227
 URL: https://issues.apache.org/jira/browse/LUCENE-4227
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4227.patch, LUCENE-4227.patch, LUCENE-4227.patch


 This postings format just wraps Lucene40 (on disk) but then at search
 time it loads (up front) all terms postings into RAM.
 You'd use this if you have insane amounts of RAM and want the fastest
 possible search performance.  The postings are not compressed: docIds,
 positions are stored as straight int[]s.
 The terms are stored as a skip list (array of byte[]), but I packed
 all terms together into a single long byte[]: I had started as actual
 separate byte[] per term but the added pointer deref and loss of
 locality was a lot (~2X) slower for terms-dict intensive queries like
 FuzzyQuery.
 Low frequency postings (docFreq = 32 by default) store all docs, pos
 and offsets into a single int[].  High frequency postings store docs
 as int[], freqs as int[], and positions as int[][] parallel arrays.
 For skipping I just do a growing binary search.
 I also made specialized DirectTermScorer and DirectExactPhraseScorer
 for the high freq case that just pull the int[] and iterate
 themselves.
 All tests pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4240) Analyzer.getOffsetGap Improvements


 [ 
https://issues.apache.org/jira/browse/LUCENE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4240:


Attachment: LUCENE-4240.patch

initial patch

 Analyzer.getOffsetGap Improvements
 --

 Key: LUCENE-4240
 URL: https://issues.apache.org/jira/browse/LUCENE-4240
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Attachments: LUCENE-4240.patch


 From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer 
 to take in an IndexableField object.  We can simplify this API:
 {quote}
 Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
 change into a separate issue?
 This would be nice to fix ASAP.
 I dont understand why it takes IndexableField or took Fieldable. All the 
 other methods here like
 getPositionIncrementGap take String fieldName. I think this one should too.
 I dont think it needs a boolean for tokenized either: returning a 0 for 
 NOT_ANALYZED fields. 
 If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!
 If you want to do expert stuff control the offset gaps between values for 
 NOT_ANALYZED fields, 
 then just analyze it instead, with keyword tokenizer!
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4240) Analyzer.getOffsetGap Improvements


[ 
https://issues.apache.org/jira/browse/LUCENE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419186#comment-13419186
 ] 

Michael McCandless commented on LUCENE-4240:


+1

 Analyzer.getOffsetGap Improvements
 --

 Key: LUCENE-4240
 URL: https://issues.apache.org/jira/browse/LUCENE-4240
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Attachments: LUCENE-4240.patch


 From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer 
 to take in an IndexableField object.  We can simplify this API:
 {quote}
 Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
 change into a separate issue?
 This would be nice to fix ASAP.
 I dont understand why it takes IndexableField or took Fieldable. All the 
 other methods here like
 getPositionIncrementGap take String fieldName. I think this one should too.
 I dont think it needs a boolean for tokenized either: returning a 0 for 
 NOT_ANALYZED fields. 
 If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!
 If you want to do expert stuff control the offset gaps between values for 
 NOT_ANALYZED fields, 
 then just analyze it instead, with keyword tokenizer!
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4240) Analyzer.getOffsetGap Improvements

2012-07-20 Thread Chris Male (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419191#comment-13419191
 ] 

Chris Male commented on LUCENE-4240:


+1

 Analyzer.getOffsetGap Improvements
 --

 Key: LUCENE-4240
 URL: https://issues.apache.org/jira/browse/LUCENE-4240
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Attachments: LUCENE-4240.patch


 From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer 
 to take in an IndexableField object.  We can simplify this API:
 {quote}
 Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
 change into a separate issue?
 This would be nice to fix ASAP.
 I dont understand why it takes IndexableField or took Fieldable. All the 
 other methods here like
 getPositionIncrementGap take String fieldName. I think this one should too.
 I dont think it needs a boolean for tokenized either: returning a 0 for 
 NOT_ANALYZED fields. 
 If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!
 If you want to do expert stuff control the offset gaps between values for 
 NOT_ANALYZED fields, 
 then just analyze it instead, with keyword tokenizer!
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: 4069Failure.zip

Attached a log of thread activity showing how 
TestIndexWriterCommit.testCommitThreadSafety() is failing.
At this stage I can't tell if this is a failing in MockDirectoryWrapper or the 
test or the BloomPF class but it is related to files being removed unexpectedly.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser

2012-07-20 Thread Karsten R. (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419042#comment-13419042
 ] 

Karsten R. edited comment on LUCENE-4109 at 7/20/12 3:30 PM:
-

Patch for lucene/contrib against 
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6
The patch adds the Processor BooleanQuery2ModifierNodeProcessor.
The patch also changes ParametricRangeQueryNodeProcessor as hotfix for 
LUCENE-3338 (this change is not for 4.X because LUCENE-3338 is already fixed in 
4.X). 
The patch passes most tests from QueryParserTestBase e.g. except  
{{{assertQueryEquals([\\* TO \*\],null,[\\* TO \\*]);}}}
and LUCENE-2566 related tests.
Patch for trunk will coming soon.

  was (Author: karsten-solr):
Patch for lucene/contrib against 
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6
The patch adds the Processor BooleanQuery2ModifierNodeProcessor.
The patch also changes ParametricRangeQueryNodeProcessor as hotfix for 
LUCENE-3338 (this change is not for 4.X because LUCENE-3338 is already fixed in 
4.X). 
The patch passes all tests from QueryParserTestBase except  
{{{assertQueryEquals([\\* TO \*\],null,[\\* TO \\*]);}}}
and LUCENE-2566 related tests.
Patch for trunk will coming soon.
  
 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4240) Analyzer.getOffsetGap Improvements


 [ 
https://issues.apache.org/jira/browse/LUCENE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-4240.
-

   Resolution: Fixed
Fix Version/s: 5.0
   4.0

 Analyzer.getOffsetGap Improvements
 --

 Key: LUCENE-4240
 URL: https://issues.apache.org/jira/browse/LUCENE-4240
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4240.patch


 From LUCENE-3151 (Robert Muir's comments): there is no need for the Analyzer 
 to take in an IndexableField object.  We can simplify this API:
 {quote}
 Hey Grant: I know it sounds silly but can we split out the getOffsetGap API 
 change into a separate issue?
 This would be nice to fix ASAP.
 I dont understand why it takes IndexableField or took Fieldable. All the 
 other methods here like
 getPositionIncrementGap take String fieldName. I think this one should too.
 I dont think it needs a boolean for tokenized either: returning a 0 for 
 NOT_ANALYZED fields. 
 If you choose NOT_ANALYZED, that should mean the Analyzer is not invoked!
 If you want to do expert stuff control the offset gaps between values for 
 NOT_ANALYZED fields, 
 then just analyze it instead, with keyword tokenizer!
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4239) Provide access to PackedInts' low-level blocks - values conversion methods


[ 
https://issues.apache.org/jira/browse/LUCENE-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419283#comment-13419283
 ] 

Han Jiang commented on LUCENE-4239:
---

Thank you Adrien! We'll work easier with this Decoder/Encoder interface.

However, This patch isn't passing ant-compile under latest trunk, seems that 
encoder/decoder methods for Packed64SingleBlockBulkOperation32 are missing? 
Anyway, we're not using docId up to 32 bits currently, I'll test the 
performance later.

Since we have to handle IndexInput/Output at upper level, we prefer to use 
direct int[] rather than IntBuffer. Actually, we had a patch making 
PackedIntsDecompress handle int array instead: 
https://issues.apache.org/jira/secure/attachment/12532888/LUCENE-3892_for_int%5B%5D.patch
 (the file name was ForDecompressImpl.java). Performance test shows little 
difference between these two versions, but as int[] is clear and simple, I 
think that should be what we hope to use.

So... maybe you can provide us methods like: encode(int[] values, long[] 
blocks, int iterations), decode(long[] blocks, int[] values, int iterations)? 

 Provide access to PackedInts' low-level blocks - values conversion methods
 

 Key: LUCENE-4239
 URL: https://issues.apache.org/jira/browse/LUCENE-4239
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4239.patch


 In LUCENE-4161 we started to make the {{PackedInts}} API more flexible so 
 that codecs could use it whenever they need to (un)pack integers. There are 
 two posting formats in progress (For and PFor, LUCENE-3892) that perform a 
 lot of integer (un)packing but the current API still has limits :
  - it only works with long[] arrays, whereas these codecs need to manipulate 
 int[] arrays,
  - the packed reader iterators work great for unpacking long sequences of 
 integers, but they would probably cause a lot of overhead to decode lots of 
 short integer sequences such as the ones that can be generated by For and 
 PFor.
 I've been looking at the For/PFor branch and it has a 
 {{PackedIntsDecompress}} class 
 (http://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892/lucene/core/src/java/org/apache/lucene/codecs/pfor/PackedIntsDecompress.java)
  which is very similar to {{oal.util.packed.BulkOperation}} 
 (package-private), so maybe we should find a way to expose this class so that 
 the For/PFor branch can directly use it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4109) BooleanQueries are not parsed correctly with the flexible query parser

2012-07-20 Thread Karsten R. (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419311#comment-13419311
 ] 

Karsten R. commented on LUCENE-4109:


Robert, I forgot to run all tests :-(
The patch must also include MultiFieldQueryNodeProcessor ({{new 
OrQueryNode(children)}} instead of {{new BooleanQueryNode(children)}}) and 
PrecedenceQueryNodeProcessorPipeline 
({{BooleanQuery2ModifierNodeProcessor.class}} instead of 
{{GroupQueryNodeProcessor.class}}). I will fix this on monday.
btw. I hope {{((b:one +b:more) t:two)}} is equal to {{((b:one +b:more) 
(+t:two))}}

 BooleanQueries are not parsed correctly with the flexible query parser
 --

 Key: LUCENE-4109
 URL: https://issues.apache.org/jira/browse/LUCENE-4109
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/queryparser
Affects Versions: 3.5, 3.6
Reporter: Daniel Truemper
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-4109.patch, LUCENE-4109.patch, test-patch.txt


 Hi,
 I just found another bug in the flexible query parser (together with Robert 
 Muir, yay!).
 The following query string works in the standard query parser:
 {noformat}
 (field:[1 TO *] AND field:[* TO 2]) AND field2:z
 {noformat}
 yields
 {noformat}
 +(+field:[1 TO *] +field:[* TO 2]) +field2:z
 {noformat}
 The flexible query parser though yields:
 {noformat}
 +(field:[1 TO *] field:[* TO 2]) +field2:z
 {noformat}
 Test patch is attached (from Robert actually).
 I don't know if it affects earlier versions than 3.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3655) A starting replica can briefly appear active after Solr starts and before recovery begins.

Mark Miller created SOLR-3655:
-

 Summary: A starting replica can briefly appear active after Solr 
starts and before recovery begins.
 Key: SOLR-3655
 URL: https://issues.apache.org/jira/browse/SOLR-3655
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 4.0, 5.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419326#comment-13419326
]

Lance Norskog commented on SOLR-3653:
-

bq. Actually there are factories in contrib/analysis-extras.
You're right, I was thinking of a previous project.
bq. I am not sure on this: if someone wants to mix an n-gram technique with a
word model, they can just use two fields? If they want to limit the n-gram
field to only longer terms, they should use LengthFilter.

Is this the design?
{code}
Word-based field:
SmartChineseWordTokenFilter - LengthFilter accept 1-3 letters
Bigram-based field:
SmartChineseWordTokenFilter - LengthFilter accept 4 or longer -
Chinese-only bigrams
{code}
This works if the user searches simple words, like on a consumer site. In the
legal document site, people block-copy 60-word document titles and expect to
find the matching title first on the list. This requires a phrase search where
0 variations in position gives the exact title. If the two classes of terms are
in two different fields, will that work? I did not think parsers did

Also, this design needs to allow for mixed language text: year numbers, English
words. Are the existing Lucene filters flexible enough to do this?

bq. The word you are upset about (中华人民共和国) is in the smartcn dictionary. As I
understand, this word basically means PRC. This is a single concept and makes
sense as an indexing unit. Why do we care how long it is in characters?

Because parts of it are also words, which should be searchable. Here are two
more failed words: 个人所得税 (personal/individual income tax) and 社会保险
(National Congress, political body). I can imagine Congress would be in the
dictionary, but personal income tax? If you search for income tax: 所得税 you
will not find personal income tax. This points up a flaw: the bigram trick will
not find this trigram.

How do you know what's in the dictionary? The files are in a .mem format. I
can't find a main program for them.

Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419330#comment-13419330
]

Robert Muir commented on SOLR-3653:
---

{quote}
Because parts of it are also words, which should be searchable.
{quote}

Says who? There is no real word boundaries in this language.

If you want to start indexing individual characters, just use StandardTokenizer.

None of your examples are failures of this tokenizer. This is what it has in
its dictionary!

Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4239) Provide access to PackedInts' low-level blocks - values conversion methods


[ 
https://issues.apache.org/jira/browse/LUCENE-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419283#comment-13419283
 ] 

Han Jiang edited comment on LUCENE-4239 at 7/20/12 4:49 PM:


Thank you Adrien! We'll work easier with this Decoder/Encoder interface.

However, This patch isn't passing ant-compile under latest trunk, seems that 
encoder/decoder methods for Packed64SingleBlockBulkOperation32 are missing? 
Anyway, we're not using docId up to 32 bits currently, I'll test the 
performance later.

* We're still using IntBuffer just because IndexInput/Ouput don't provide a 
read/writeInts() method :). Since we still have to handle IndexInput/Output at 
upper level, we prefer to use direct int[] rather than IntBuffer now. Actually, 
we had a patch making PackedIntsDecompress handle int[] instead, you can have a 
glance at it: http://pastebin.com/euvtBD8P. Performance test show little 
difference between these two versions, and we should choose a clean  simple 
impl right?

* As for PFor, we may have to encode another small block of ints with packed 
format when blockSize128 and blockSize%32 != 0. Current impl will use 
numBits=8,16,32 to simplify decoder. However, we may consider to use other 
numBits in near future, I'm afraid this will be a bottleneck when decoder is 
not hardcoded.


So... as a second shot, maybe you can provide us methods like: encode(int[] 
values, long[] blocks, int iterations), decode(long[] blocks, int[] values, int 
iterations)? 



  was (Author: billy):
Thank you Adrien! We'll work easier with this Decoder/Encoder interface.

However, This patch isn't passing ant-compile under latest trunk, seems that 
encoder/decoder methods for Packed64SingleBlockBulkOperation32 are missing? 
Anyway, we're not using docId up to 32 bits currently, I'll test the 
performance later.

Since we have to handle IndexInput/Output at upper level, we prefer to use 
direct int[] rather than IntBuffer. Actually, we had a patch making 
PackedIntsDecompress handle int array instead: 
https://issues.apache.org/jira/secure/attachment/12532888/LUCENE-3892_for_int%5B%5D.patch
 (the file name was ForDecompressImpl.java). Performance test shows little 
difference between these two versions, but as int[] is clear and simple, I 
think that should be what we hope to use.

So... maybe you can provide us methods like: encode(int[] values, long[] 
blocks, int iterations), decode(long[] blocks, int[] values, int iterations)? 
  
 Provide access to PackedInts' low-level blocks - values conversion methods
 

 Key: LUCENE-4239
 URL: https://issues.apache.org/jira/browse/LUCENE-4239
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4239.patch


 In LUCENE-4161 we started to make the {{PackedInts}} API more flexible so 
 that codecs could use it whenever they need to (un)pack integers. There are 
 two posting formats in progress (For and PFor, LUCENE-3892) that perform a 
 lot of integer (un)packing but the current API still has limits :
  - it only works with long[] arrays, whereas these codecs need to manipulate 
 int[] arrays,
  - the packed reader iterators work great for unpacking long sequences of 
 integers, but they would probably cause a lot of overhead to decode lots of 
 short integer sequences such as the ones that can be generated by For and 
 PFor.
 I've been looking at the For/PFor branch and it has a 
 {{PackedIntsDecompress}} class 
 (http://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892/lucene/core/src/java/org/apache/lucene/codecs/pfor/PackedIntsDecompress.java)
  which is very similar to {{oal.util.packed.BulkOperation}} 
 (package-private), so maybe we should find a way to expose this class so that 
 the For/PFor branch can directly use it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4237) add ant task to generate optionally ALL javadocs


 [ 
https://issues.apache.org/jira/browse/LUCENE-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-4237.
-

   Resolution: Fixed
Fix Version/s: (was: 4.0-ALPHA)
   5.0
   4.0

 add ant task to generate optionally ALL javadocs
 

 Key: LUCENE-4237
 URL: https://issues.apache.org/jira/browse/LUCENE-4237
 Project: Lucene - Java
  Issue Type: Improvement
  Components: general/javadocs
Reporter: Bernd Fehling
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4237.patch


 As of jira LUCENE-3977 the generation of javadocs has been cleaned up and is 
 now set fix to 'noindex' to keep distributions small.
 An ant task should make this selectable to have the option for really 
 building ALL javadocs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3655) A starting replica can briefly appear active after Solr starts and before recovery begins.


[ 
https://issues.apache.org/jira/browse/SOLR-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419353#comment-13419353
 ] 

Mark Miller commented on SOLR-3655:
---

Hmm..almost looks like i thought this mostly because of a ui bug - i think 
perhaps its showing green for a moment when it should not. When i try and check 
the same thing through the zk tree, it looks right.

I did tighten things so that the leader for sure sees a down state before the 
replica registers its live node though.

 A starting replica can briefly appear active after Solr starts and before 
 recovery begins.
 --

 Key: SOLR-3655
 URL: https://issues.apache.org/jira/browse/SOLR-3655
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 4.0, 5.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3656) use same data dir in core reload

2012-07-20 Thread Yonik Seeley (JIRA)

Yonik Seeley created SOLR-3656:
--

 Summary: use same data dir in core reload
 Key: SOLR-3656
 URL: https://issues.apache.org/jira/browse/SOLR-3656
 Project: Solr
  Issue Type: Bug
Reporter: Yonik Seeley
Priority: Minor


When a core reload is issued, we should use the same data dir.
This causes problems for things like our test framework that reload the core 
and end up with the data dir in a different place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: svn commit: r1363272 - /lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java

2012-07-20 Thread Robert Muir

Hi Martinj: thanks for looking into this!

I think have a better fix for these:

the problem is actually in the AssertingAtomicReaders that
AssertingDirectoryReader wraps its subreaders with.
So I added the invisible-cache-key hack there, and removed it
completely from LuceneTestCase.

I tested this with the hudson seeds that failed (at their appropriate
revisions) and it seems to work fine.
I also ran tests for queries/grouping/join with -Dnightly=true,
-Dtests.multiplier=5, etc etc a few times and it all works.

I'd really like to have AssertingDirectoryReader being used again.
If there are problems we can just back out the change.

On Thu, Jul 19, 2012 at 5:48 AM,  m...@apache.org wrote:
 Author: mvg
 Date: Thu Jul 19 09:48:04 2012
 New Revision: 1363272

 URL: http://svn.apache.org/viewvc?rev=1363272view=rev
 Log:
 Fix of rare FC insanity during tests that have occurred in grouping  joining 
 tests.

 Modified:
 
 lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java

 Modified: 
 lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java?rev=1363272r1=1363271r2=1363272view=diff
 ==
 --- 
 lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java
  (original)
 +++ 
 lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java
  Thu Jul 19 09:48:04 2012
 @@ -1048,7 +1048,7 @@ public abstract class LuceneTestCase ext
  if (r instanceof AtomicReader) {
r = new FCInvisibleMultiReader(new 
 AssertingAtomicReader((AtomicReader)r));
  } else if (r instanceof DirectoryReader) {
 -  r = new FCInvisibleMultiReader(new 
 AssertingDirectoryReader((DirectoryReader)r));
 +  r = new FCInvisibleMultiReader((DirectoryReader)r);
  }
  break;
default:





-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3892:
--

Attachment: LUCENE-3892-blockFor-with-packedints-decoder.patch

Patch with the decoder interface, mentioned in LUCENE-4239. I'm afraid that the 
for loop of readLong() hurts the performance. Here is the comparison against 
last patch:
{noformat}
TaskQPS base StdDev baseQPS comp StdDev comp  Pct 
diff
 AndHighHigh   21.890.64   22.140.43   -3% -
6%
  AndHighMed   52.232.34   52.941.74   -6% -
9%
  Fuzzy1   86.611.63   87.293.14   -4% -
6%
  Fuzzy2   30.540.54   30.951.18   -4% -
7%
  IntNRQ   38.001.23   38.141.04   -5% -
6%
  OrHighHigh   16.370.21   16.680.79   -4% -
8%
   OrHighMed   39.590.69   40.342.16   -5% -
9%
PKLookup  111.511.34  112.781.37   -1% -
3%
  Phrase4.540.124.520.13   -5% -
5%
 Prefix3  107.852.51  109.132.10   -3% -
5%
 Respell  123.212.18  125.155.01   -4% -
7%
SloppyPhrase6.510.116.440.29   -7% -
5%
SpanNear5.360.165.310.14   -6% -
4%
Term   42.491.66   44.101.86   -4% -   
12%
TermBGroup1M   17.860.80   17.820.51   -7% -
7%
  TermBGroup1M1P   21.080.55   21.100.62   -5% -
5%
 TermGroup1M   19.570.82   19.570.64   -7% -
7%
Wildcard   43.991.21   44.801.10   -3% -
7%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3657) error message only refers to source field when problem parsing value for dest field of copyField

Hoss Man created SOLR-3657:
--

 Summary: error message only refers to source field when problem 
parsing value for dest field of copyField
 Key: SOLR-3657
 URL: https://issues.apache.org/jira/browse/SOLR-3657
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man


When a client submits a document with a value that is copyFielded into a dest 
field where the value is not suitable (ie: something that is not a number 
copied into a numeric field) the error message only refers to the original 
source field name, not the dest field name.  ideally it should mention both 
fields

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3657) error message only refers to source field when problem parsing value for dest field of copyField


[ 
https://issues.apache.org/jira/browse/SOLR-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419371#comment-13419371
 ] 

Hoss Man commented on SOLR-3657:


Info from solr-user...

{noformat}
schema.xml:
types...
fieldtype name=text_not_empty class=solr.TextField
analyzer
tokenizer class=solr.KeywordTokenizerFactory 
/
filter class=solr.TrimFilterFactory /
filter class=solr.LengthFilterFactory 
min=1 max=20 /
/analyzer
/fieldtype
/types

fields...
field name=estimated_hours type=tfloat indexed=true 
stored=true required=false /
field name=s_estimated_hours type=text_not_empty 
indexed=false stored=false /
/fields

copyField source=s_estimated_hours dest=estimated_hours /

...

WARNUNG: Error creating document : SolrInputDocument[{id=id(1.0)={2930},
s_estimated_hours=s_estimated_hours(1.0)={}}]
org.apache.solr.common.SolrException: ERROR: [doc=2930] Error adding field 
's_estimated_hours'=''
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:333)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:66)
at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:723)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.NumberFormatException: empty String
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:992)
at java.lang.Float.parseFloat(Float.java:422)
at org.apache.solr.schema.TrieField.createField(TrieField.java:410)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:289)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:107)
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:312)
... 11 more
{noformat}

My response...

{quote}
I believe this is intentional, but i can understand how it might be confusing.

I think the point here is that since the field submitted by the client was 
named s_estimated_hours that's the field used in the error reported back to 
the client when something goes wrong with the copyField -- if the error message 
refered to estimated_hours the client may not have any idea why/where that 
field came from.

But i can certainly understand the confusion, i've opened SOLR-3657 to try and 
improve on this.  Ideally the error message should make it clear that the 
value from source field was copied to dest field which then encountered 
error
{quote}

 error message only refers to source field when problem parsing value for 
 dest field of copyField
 

 Key: SOLR-3657
 URL: https://issues.apache.org/jira/browse/SOLR-3657
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 When a client submits a document with a value that is copyFielded into a 
 dest field where the value is not suitable (ie: something that is not a 
 number copied into a numeric field) the error message only refers to the 
 original source field name, not the dest field name.  ideally it should 
 mention both fields

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3656) use same data dir in core reload

2012-07-20 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-3656:
---

Attachment: SOLR-3656.patch

Simple patch that just passes the directory of the current core when creating 
the new core.  All tests pass.

 use same data dir in core reload
 

 Key: SOLR-3656
 URL: https://issues.apache.org/jira/browse/SOLR-3656
 Project: Solr
  Issue Type: Bug
Reporter: Yonik Seeley
Priority: Minor
 Attachments: SOLR-3656.patch


 When a core reload is issued, we should use the same data dir.
 This causes problems for things like our test framework that reload the core 
 and end up with the data dir in a different place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419380#comment-13419380
]

Mark Miller commented on SOLR-1781:
---

Hmm...i can't replicate this issue so far.

Another change around then was updating to ZooKeeper 3.3.5 (bug fix update).

I wouldnt expect that to be an issue - but are you just upgrading one node and
not all of them?

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419381#comment-13419381
 ] 

Robert Muir commented on LUCENE-3892:
-

{quote}
I'm afraid that the for loop of readLong() hurts the performance. Here is the 
comparison against last patch:
{quote}

I think so too. I think in each enum, up front you want a pre-allocated byte[] 
(maximum size possible for the block),
and you do ByteBuffer.wrap(x).asLongBuffer.

after you read the header, call readBytes() and then just rewind()?

So this is just like what you do now in the branch, except with LongBuffer 
instead of IntBuffer

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419385#comment-13419385
]

Markus Jelsma commented on SOLR-1781:
-

Strange indeed. I can/could replicate it on one machine consistently and not on
others. Machines weren't upgraded at the same time to prevent cluster downtime.

I'll check back monday, there are two other machines left to upgrade plus the
bad node.

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2115) DataImportHandler config file must be specified in defaults or status will be DataImportHandler started. Not Initialized. No commands can be run

2012-07-20 Thread James Dyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2115:
-

Description: 
The DataImportHandler has two URL parameters for defining the data-config.xml 
file to be used for the command. 'config' is used in some places and 
'dataConfig' is used in other places.

'config' does not work from an HTTP request. However, if it is in the 
defaults section of the DIH requestHandler definition, it works. If the 
'config' parameter is used in an HTTP request, the DIH uses the default in the 
requestHandler anyway.

This is the exception stack recieved by the client if there is no default. 
(This is the 3.X branch.)


html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prenull

java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:146)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
..etc..

  was:
The DataImportHandler has two URL parameters for defining the data-config.xml 
file to be used for the command. 'config' is used in some places and 
'dataConfig' is used in other places.

'config' does not work from an HTTP request. However, if it is in the 
defaults section of the DIH requestHandler definition, it works. If the 
'config' parameter is used in an HTTP request, the DIH uses the default in the 
requestHandler anyway.

This is the exception stack recieved by the client if there is no default. 
(This is the 3.X branch.)


html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prenull

java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:146)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
/pre
pRequestURI=/solr/db/dataimport/ppismalla 
href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/   
 
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/

[jira] [Commented] (LUCENE-4239) Provide access to PackedInts' low-level blocks - values conversion methods


[ 
https://issues.apache.org/jira/browse/LUCENE-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419415#comment-13419415
 ] 

Michael McCandless commented on LUCENE-4239:


I think we should just commit this current patch onto the block PF branch 
(https://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892 )?  Then 
we can iterate on it, from both ends...

 Provide access to PackedInts' low-level blocks - values conversion methods
 

 Key: LUCENE-4239
 URL: https://issues.apache.org/jira/browse/LUCENE-4239
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4239.patch


 In LUCENE-4161 we started to make the {{PackedInts}} API more flexible so 
 that codecs could use it whenever they need to (un)pack integers. There are 
 two posting formats in progress (For and PFor, LUCENE-3892) that perform a 
 lot of integer (un)packing but the current API still has limits :
  - it only works with long[] arrays, whereas these codecs need to manipulate 
 int[] arrays,
  - the packed reader iterators work great for unpacking long sequences of 
 integers, but they would probably cause a lot of overhead to decode lots of 
 short integer sequences such as the ones that can be generated by For and 
 PFor.
 I've been looking at the For/PFor branch and it has a 
 {{PackedIntsDecompress}} class 
 (http://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892/lucene/core/src/java/org/apache/lucene/codecs/pfor/PackedIntsDecompress.java)
  which is very similar to {{oal.util.packed.BulkOperation}} 
 (package-private), so maybe we should find a way to expose this class so that 
 the For/PFor branch can directly use it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile

Hoss Man created LUCENE-4241:


 Summary: non-reproducible failures from RecoveryZkTest - mostly 
NRTCachingDirectory.deleteFile
 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man


Since getting my new laptop, i've noticed some sporadic failures from 
RecoveryZkTest, so last night tried running 100 iterations againts trunk 
(r1363555), and got 5 errors/failures...

* 3 asertion failures from NRTCachingDirectory.deleteFile
* 1 node recovery assertion from 
AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
* 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering affects 
from OOM?)

see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


 [ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-4241:
-

Attachment: just-failures.txt
RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz

Full tests-failures.txt (compressed) and a summary file containing just the 
failure stack traces (no log output) 

 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2115) DataImportHandler config file must be specified in defaults or status will be DataImportHandler started. Not Initialized. No commands can be run

2012-07-20 Thread James Dyer (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2115:
-

Attachment: SOLR-2115.patch

With this patch...

- DIH attempts to reload the configuration every time a new import is started.
This is slightly more overhead, but negligible compared with the time an import
takes as a whole.

- The config is not loaded on startup and there is no need to have a defaults
/ section or have the config declared in solrconfig.xml at all. Instead users
have the option to specify the config file on the request with the config
parameter.

- The dataConfig parameter, which lets users include the entire configuration
as a request parameter is now always supported (previously this was only
supported in debug mode)

-The reload-config command is still supported, which is useful for validating
a new configuration file, or if you want to specify a file, load it, and not
have it reloaded again on import.

- Datasources can still be specified in solrconfig.xml. As before these must
be specified in the defaults section of the handler in solrconfig.xml.
However, these are not parsed until the main configuration is loaded.

- If there is an xml mistake in the configuration a much more user-friendly
message is given in xml format, not raw format as before. Users can fix the
problem and reload-config.

DataImportHandler config file *must* be specified in defaults or status
will be DataImportHandler started. Not Initialized. No commands can be run
--

Key: SOLR-2115
URL: https://issues.apache.org/jira/browse/SOLR-2115
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 1.4.1, 1.4.2, 3.1, 4.0-ALPHA
Reporter: Lance Norskog
Assignee: James Dyer
Priority: Minor
Fix For: 4.0

Attachments: SOLR-2115.patch

The DataImportHandler has two URL parameters for defining the data-config.xml
file to be used for the command. 'config' is used in some places and
'dataConfig' is used in other places.
'config' does not work from an HTTP request. However, if it is in the
defaults section of the DIH requestHandler definition, it works. If the
'config' parameter is used in an HTTP request, the DIH uses the default in
the requestHandler anyway.
This is the exception stack recieved by the client if there is no default.
(This is the 3.X branch.)
html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prenull
java.lang.NullPointerException
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:146)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
..etc..

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1781) Replication index directories not always cleaned up

[
https://issues.apache.org/jira/browse/SOLR-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419427#comment-13419427
]

Mark Miller commented on SOLR-1781:
---

bq. Machines weren't upgraded at the same time to prevent cluster downtime.

Yeah, makes sense, just wasn't sure how you went about it. I'd expect a bugfix
release of zookeeper to work no problem with the previous nodes, but it's the
other variable I think. They recommend upgrading with rolling restarts, so it
shouldn't be the problem...

Replication index directories not always cleaned up
---

Attachments:
0001-Replication-does-not-always-clean-up-old-directories.patch,
SOLR-1781.patch, SOLR-1781.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2482) DataImportHandler; reload-config; response in case of failure further requests

2012-07-20 Thread James Dyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419428#comment-13419428
 ] 

James Dyer commented on SOLR-2482:
--

See SOLR-2115 for a patch that solves both issues.

 DataImportHandler; reload-config; response in case of failure  further 
 requests
 

 Key: SOLR-2482
 URL: https://issues.apache.org/jira/browse/SOLR-2482
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler, web gui
Reporter: Stefan Matheis (steffkes)
Priority: Minor
 Attachments: reload-config-error.html


 Reloading while the config-file is valid is completely fine, but if the 
 config is broken - the Response is plain HTML, containing the full stacktrace 
 (see attachment). further requests contain a {{status}} Element with 
 ??DataImportHandler started. Not Initialized. No commands can be run??, but 
 respond with a HTTP-Status 200 OK :/
 Would be nice, if:
 * the response in case of error could also be xml formatted
 * contain the exception message (in my case ??The end-tag for element type 
 entity must end with a 'gt;' delimiter.??) in a seperate field
 * use a better/correct http-status for the latter mentioned requests, i would 
 suggest {{503 Service Unavailable}}
 So we are able to display to error-message to the user, while the config gets 
 broken - and for the further requests we could rely on the http-status and 
 have no need to check the content of the xml-response.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


[ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419429#comment-13419429
 ] 

Mark Miller commented on LUCENE-4241:
-

Probably related to https://issues.apache.org/jira/browse/LUCENE-4238

I didn't see it in that test, but I've seen it. I have a small test case that 
demonstarts one of the problems.

 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4239) Provide access to PackedInts' low-level blocks - values conversion methods


[ 
https://issues.apache.org/jira/browse/LUCENE-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419430#comment-13419430
 ] 

Han Jiang commented on LUCENE-4239:
---

bq. I think we should just commit this current patch onto the block PF branch 
...
+1, but shall we wait Adrien to fix the missing methods first?

 Provide access to PackedInts' low-level blocks - values conversion methods
 

 Key: LUCENE-4239
 URL: https://issues.apache.org/jira/browse/LUCENE-4239
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4239.patch


 In LUCENE-4161 we started to make the {{PackedInts}} API more flexible so 
 that codecs could use it whenever they need to (un)pack integers. There are 
 two posting formats in progress (For and PFor, LUCENE-3892) that perform a 
 lot of integer (un)packing but the current API still has limits :
  - it only works with long[] arrays, whereas these codecs need to manipulate 
 int[] arrays,
  - the packed reader iterators work great for unpacking long sequences of 
 integers, but they would probably cause a lot of overhead to decode lots of 
 short integer sequences such as the ones that can be generated by For and 
 PFor.
 I've been looking at the For/PFor branch and it has a 
 {{PackedIntsDecompress}} class 
 (http://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892/lucene/core/src/java/org/apache/lucene/codecs/pfor/PackedIntsDecompress.java)
  which is very similar to {{oal.util.packed.BulkOperation}} 
 (package-private), so maybe we should find a way to expose this class so that 
 the For/PFor branch can directly use it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


[ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419438#comment-13419438
 ] 

Robert Muir commented on LUCENE-4241:
-

Don't you think the problem is likely that solr's replication doesnt use the 
Directory api, instead works on Files directly, and its accessing 
NRTCachingDir's delegate FSDir, then modifying files in that directory all 
underneath nrtcachingdir?


 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4238) NRTCachingDirectory has concurrency bug(s).


 [ 
https://issues.apache.org/jira/browse/LUCENE-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-4238:


Attachment: LUCENE-4238.patch

Here is an ugly rough test i started playing with yesterday - it will trigger 
the first exception quite often for me. 

 NRTCachingDirectory has concurrency bug(s).
 ---

 Key: LUCENE-4238
 URL: https://issues.apache.org/jira/browse/LUCENE-4238
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/store
Reporter: Mark Miller
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4238.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4238) NRTCachingDirectory has concurrency bug(s).


[ 
https://issues.apache.org/jira/browse/LUCENE-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419440#comment-13419440
 ] 

Mark Miller commented on LUCENE-4238:
-

Looks like Hossman has also seen this same issue with RecoveryZkTest. 
Considering it doesn't seem to fail with that exception on all our jenkins 
machines, may not be easy to see it there.

 NRTCachingDirectory has concurrency bug(s).
 ---

 Key: LUCENE-4238
 URL: https://issues.apache.org/jira/browse/LUCENE-4238
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/store
Reporter: Mark Miller
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4238.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4238) NRTCachingDirectory has concurrency bug(s).


[ 
https://issues.apache.org/jira/browse/LUCENE-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419443#comment-13419443
 ] 

Mark Miller commented on LUCENE-4238:
-

Somehow it seems not too difficult to get a file both cached and in the 
underlying dir - and delete and an assert really don't like that.

 NRTCachingDirectory has concurrency bug(s).
 ---

 Key: LUCENE-4238
 URL: https://issues.apache.org/jira/browse/LUCENE-4238
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/store
Reporter: Mark Miller
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4238.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3892:
--

Attachment: LUCENE-3892-blockFor-with-packedints-decoder.patch

base: PackedInts.getReaderNoHeader().get(long[]), file io is handled by 
PackedInts.

comp:
PackedInts.getDecoder().decode(LongBuffer,LongBuffer), use byte[] to hold the 
compressed block, and ByteBuffer.wrap().asLongBuffer as a wrapper.

Well, not as expected.
{noformat}
TaskQPS base StdDev baseQPS comp StdDev comp  Pct 
diff
 AndHighHigh   23.781.06   23.380.42   -7% -
4%
  AndHighMed   52.063.28   50.821.21  -10% -
6%
  Fuzzy1   88.560.59   88.982.38   -2% -
3%
  Fuzzy2   28.800.36   28.970.83   -3% -
4%
  IntNRQ   41.921.67   41.340.50   -6% -
3%
  OrHighHigh   15.850.45   15.890.39   -4% -
5%
   OrHighMed   20.380.61   20.500.62   -5% -
6%
PKLookup  110.722.19  111.742.53   -3% -
5%
  Phrase7.510.127.050.18   -9% -   
-2%
 Prefix3  106.272.65  105.371.13   -4% -
2%
 Respell  112.030.81  112.792.71   -2% -
3%
SloppyPhrase   15.430.48   14.920.27   -7% -
1%
SpanNear3.520.103.410.06   -7% -
1%
Term   39.191.34   39.040.81   -5% -
5%
TermBGroup1M   18.450.68   18.330.56   -7% -
6%
  TermBGroup1M1P   22.780.90   22.260.56   -8% -
4%
 TermGroup1M   19.500.73   19.420.63   -7% -
6%
Wildcard   29.561.13   29.180.28   -5% -
3%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)


[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419444#comment-13419444
 ] 

Han Jiang edited comment on LUCENE-3892 at 7/20/12 6:59 PM:


So I changed the patch to readBytes():

base: PackedInts.getReaderNoHeader().get(long[]), file io is handled by 
PackedInts.
comp: PackedInts.getDecoder().decode(LongBuffer,LongBuffer), use byte[] to hold 
the compressed block, and ByteBuffer.wrap().asLongBuffer as a wrapper.

Well, not as expected.
{noformat}
TaskQPS base StdDev baseQPS comp StdDev comp  Pct 
diff
 AndHighHigh   23.781.06   23.380.42   -7% -
4%
  AndHighMed   52.063.28   50.821.21  -10% -
6%
  Fuzzy1   88.560.59   88.982.38   -2% -
3%
  Fuzzy2   28.800.36   28.970.83   -3% -
4%
  IntNRQ   41.921.67   41.340.50   -6% -
3%
  OrHighHigh   15.850.45   15.890.39   -4% -
5%
   OrHighMed   20.380.61   20.500.62   -5% -
6%
PKLookup  110.722.19  111.742.53   -3% -
5%
  Phrase7.510.127.050.18   -9% -   
-2%
 Prefix3  106.272.65  105.371.13   -4% -
2%
 Respell  112.030.81  112.792.71   -2% -
3%
SloppyPhrase   15.430.48   14.920.27   -7% -
1%
SpanNear3.520.103.410.06   -7% -
1%
Term   39.191.34   39.040.81   -5% -
5%
TermBGroup1M   18.450.68   18.330.56   -7% -
6%
  TermBGroup1M1P   22.780.90   22.260.56   -8% -
4%
 TermGroup1M   19.500.73   19.420.63   -7% -
6%
Wildcard   29.561.13   29.180.28   -5% -
3%
{noformat}

  was (Author: billy):
base: PackedInts.getReaderNoHeader().get(long[]), file io is handled by 
PackedInts.

comp:
PackedInts.getDecoder().decode(LongBuffer,LongBuffer), use byte[] to hold the 
compressed block, and ByteBuffer.wrap().asLongBuffer as a wrapper.

Well, not as expected.
{noformat}
TaskQPS base StdDev baseQPS comp StdDev comp  Pct 
diff
 AndHighHigh   23.781.06   23.380.42   -7% -
4%
  AndHighMed   52.063.28   50.821.21  -10% -
6%
  Fuzzy1   88.560.59   88.982.38   -2% -
3%
  Fuzzy2   28.800.36   28.970.83   -3% -
4%
  IntNRQ   41.921.67   41.340.50   -6% -
3%
  OrHighHigh   15.850.45   15.890.39   -4% -
5%
   OrHighMed   20.380.61   20.500.62   -5% -
6%
PKLookup  110.722.19  111.742.53   -3% -
5%
  Phrase7.510.127.050.18   -9% -   
-2%
 Prefix3  106.272.65  105.371.13   -4% -
2%
 Respell  112.030.81  112.792.71   -2% -
3%
SloppyPhrase   15.430.48   14.920.27   -7% -
1%
SpanNear3.520.103.410.06   -7% -
1%
Term   39.191.34   39.040.81   -5% -
5%
TermBGroup1M   18.450.68   18.330.56   -7% -
6%
  TermBGroup1M1P   22.780.90   22.260.56   -8% -
4%
 TermGroup1M   19.500.73   19.420.63   -7% -
6%
Wildcard   29.561.13   29.180.28   -5% -
3%
{noformat}
  
 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch,

[jira] [Commented] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


[ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419447#comment-13419447
 ] 

Mark Miller commented on LUCENE-4241:
-

I don't know what's causing it...

But the assert that is generally tripped on delete is tripped because its 
trying to uncache a file and finds it in the delegate - and its trying to 
assert the file is not in both - and I can seem to cause that condition kind of 
easily in a simple multi threaded nrtdir test.

I'm also not sure that removing files underneath this dir would cause that 
situation - it would seem not.

 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


[ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419451#comment-13419451
 ] 

Robert Muir commented on LUCENE-4241:
-

ok, i know i ran into hellacious problems trying to get all tests using 
mockdirectorywrapper, i had to disable it
in the TestReplicationHandler for this reason because it adds files outside of 
the directory api, calling sync itself
(but mockdirectorywrapper doesnt know about this).

This also makes it impossible to re-use MDW's facilities for testing disk full 
etc (e.g. SOLR-3023).

 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3292) /browse example fails to load on 3x: no field name specified in query and no default specified via 'df' param


[ 
https://issues.apache.org/jira/browse/SOLR-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419452#comment-13419452
 ] 

Jan Høydahl commented on SOLR-3292:
---

Why was this not caught by any tests? Should we add one?

 /browse example fails to load on 3x: no field name specified in query and no 
 default specified via 'df' param
 ---

 Key: SOLR-3292
 URL: https://issues.apache.org/jira/browse/SOLR-3292
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Hoss Man
Priority: Blocker
 Fix For: 3.6, 4.0, 5.0


 1) java -jar start.jar using solr example on 3x branch circa r1306629
 2) load http://localhost:8983/solr/browse
 3) browser error: 400 no field name specified in query and no default 
 specified via 'df' param
 4) error in logs...
 {noformat}
 INFO: [] webapp=/solr path=/browse params={} hits=0 status=400 QTime=3 
 Mar 28, 2012 4:05:59 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: no field name specified in 
 query and no default specified via 'df' param
   at 
 org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:158)
   at 
 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:174)
   at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1429)
   at 
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1317)
   at 
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1245)
   at 
 org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1234)
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:79)
   at org.apache.solr.search.QParser.getQuery(QParser.java:143)
   at 
 org.apache.solr.request.SimpleFacets.getFacetQueryCounts(SimpleFacets.java:233)
   at 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:194)
   at 
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
   at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3654) Add some tests using Tomcat as servlet container

2012-07-20 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419457#comment-13419457
 ] 

Steven Rowe commented on SOLR-3654:
---

bq. I'm 100% against this. 

Why?

 Add some tests using Tomcat as servlet container
 

 Key: SOLR-3654
 URL: https://issues.apache.org/jira/browse/SOLR-3654
 Project: Solr
  Issue Type: Task
  Components: Build
 Environment: Tomcat
Reporter: Jan Høydahl
  Labels: Tomcat
 Fix For: 4.0


 All tests use Jetty, we should add some tests for at least one other servlet 
 container (Tomcat). Ref discussion at http://search-lucene.com/m/6mo9Y1WZaWR1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4241) non-reproducible failures from RecoveryZkTest - mostly NRTCachingDirectory.deleteFile


[ 
https://issues.apache.org/jira/browse/LUCENE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419459#comment-13419459
 ] 

Mark Miller commented on LUCENE-4241:
-

Yeah, I'm not ruling it out either - I try to be very careful before filing 
lucene bugs based on replication stuff and solrcloud especially (with all its 
jetty killing and what not). But I seemed to be able to cause the same 
situation with an isolated test (im not tripping the same assert, but causing a 
different exception in close because of the same invariant). So I'm somewhat 
sure it's a real issue with the dir, but since I don't have a fix, I don't know 
for sure. I tried just over syncing by synchronizing every method in that dir, 
but no luck :)

 non-reproducible failures from RecoveryZkTest - mostly 
 NRTCachingDirectory.deleteFile
 -

 Key: LUCENE-4241
 URL: https://issues.apache.org/jira/browse/LUCENE-4241
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Hoss Man
 Attachments: 
 RecoveryZkTest.testDistribSearch-100-tests-failures.txt.tgz, just-failures.txt


 Since getting my new laptop, i've noticed some sporadic failures from 
 RecoveryZkTest, so last night tried running 100 iterations againts trunk 
 (r1363555), and got 5 errors/failures...
 * 3 asertion failures from NRTCachingDirectory.deleteFile
 * 1 node recovery assertion from 
 AbstractDistributedZkTestCase.waitForRecoveriesToFinish caused by OOM
 * 1 searcher leak assertion: opens=1658 closes=1652 (possibly lingering 
 affects from OOM?)
 see comments/attachments for details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3640) Can't seem to click on any of the core admin buttons anymore

2012-07-20 Thread Antony Stubbs (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419464#comment-13419464
 ] 

Antony Stubbs commented on SOLR-3640:
-

Ah - sorry guys. Should have tried it out in the others. It appears to render 
and perform actions correctly in FireFox and Safari. I.e. it's not even a 
webkit level issue - it only doesn't seem to work in Chrome.

 Can't seem to click on any of the core admin buttons anymore
 

 Key: SOLR-3640
 URL: https://issues.apache.org/jira/browse/SOLR-3640
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.0-ALPHA
Reporter: Antony Stubbs
Priority: Critical
 Attachments: Screen Shot 2012-07-18 at 3.05.10 PM.png, 
 screenshot-1.jpg


 Trying to click on any of the buttons apparently has no affect. They also 
 have no icons next to them anymore and appear down the left.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3654) Add some tests using Tomcat as servlet container


[ 
https://issues.apache.org/jira/browse/SOLR-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419473#comment-13419473
 ] 

Jan Høydahl commented on SOLR-3654:
---

Mark, you have started committing changes making Solr Jetty-bound, without a 
prior discussion on dev. Many of our users depend on Solr working with their 
app-servers, especially the OEMs, so such radical change of direction cannot be 
made without a thorough discussion and preferably a [VOTE].

Please continue voice your view, and aid in constructive planning for how and 
when Solr could (if it should) become a standalone app rather than a WAR - also 
consulting the user community, but it is not constructive to block test quality 
progress in 4.0 as long as 4.0 is planned to be a WAR release as before.

 Add some tests using Tomcat as servlet container
 

 Key: SOLR-3654
 URL: https://issues.apache.org/jira/browse/SOLR-3654
 Project: Solr
  Issue Type: Task
  Components: Build
 Environment: Tomcat
Reporter: Jan Høydahl
  Labels: Tomcat
 Fix For: 4.0


 All tests use Jetty, we should add some tests for at least one other servlet 
 container (Tomcat). Ref discussion at http://search-lucene.com/m/6mo9Y1WZaWR1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3623) analysis-extras lucene libraries are redundenly packaged (in war and in lucene-libs)