[jira] [Commented] (CONNECTORS-286) Get ManifoldCF to run on top of a key/value store like Voldemort, for potential massive scalability improvements and speed gains

2012-01-03 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178695#comment-13178695
 ] 

Karl Wright commented on CONNECTORS-286:


Using the Warthog API as the standard ManifoldCF way of dealing with databases 
may not be practical, for the following reasons.
- A significant amount of the actual functionality of Warthog comes from java 
methods you supply to it.  This is incompatible fundamentally with using a 
standard database to do the same thing, because there are bound to be 
situations where the two implementations disagree.
- A full database implementation under Warthog entails using the database for 
table storage and index access (ordered) with conditions applied to the index.  
Warthog would do the rest.  But it is conceivable that this would not perform 
as well as native database queries.
- It is not clear how to construct a cache key in Warthog, so caching database 
results will require some thought.  Caching at the interface to the underlying 
database is not practical at all, because only partial resultsets will be read 
from many of the queries.
- It's not even clear (yet) whether critical functionality is missing from 
Warthog that will be needed to implement ManifoldCF.

Nevertheless, the next step is to try to create an implementation of Warthog 
where WHTableStore, WHTable, and WHIndex are implemented by an underlying 
relational database.  The difficulty in this, as stated above, occurs because 
the index (for example) is defined in terms of a WHComparator for each column 
being indexed, which is opaque Java code. Instead of merely performing the 
comparison, the code must, in addition, be in accordance with what the database 
is doing, AND also be capable of assisting in the generation of SQL code.  
Special SQL-consistent WHComparator implementations are therefore going to be 
necessary, which also implement another interface (SQLInspectable?).  The 
WHIndex implementation can therefore use them to do what it needs, and complain 
if somebody tries to use incompatible comparator implementations.

Thus, each implementation of the Warthog API consists of:
- Implementations of WHTableStore and WHTable and WHIndex
- A body of comparators, filters, etc. that implement data types consistent 
with the SQL database
 


 Get ManifoldCF to run on top of a key/value store like Voldemort, for 
 potential massive scalability improvements and speed gains
 

 Key: CONNECTORS-286
 URL: https://issues.apache.org/jira/browse/CONNECTORS-286
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework core
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF next


 ManifoldCF's reliance on a relational database limits its throughput and 
 scalability.  I am now convinced it is possible to build all the structures 
 we need within a distributed key-value store like Voldemort, which has the 
 nice side effect of permitting massive scaling.  I envision there will be 
 several layers to this project, some of which may have broader utility in the 
 open-source community at large:
 (1) An atomic serialization layer, which adds serialization capabilities to 
 an non-transactional substrate;
 (2) A transaction layer, which uses atomic serialization to build a notion of 
 light transactions;
 (3) A table and index layer, which defines SQL-like concepts of tables and 
 btree indexes on top of the transaction layer, via a Java API;
 (4) A generic database abstraction layer, which is capable of representing 
 both standard SQL databases as well as this NoSQL variant, so that ManifoldCF 
 can support both models.
 This is obviously a major development task, and as such is not envisioned to 
 be completed by the next standard release.  Work will indeed need to be done 
 in a branch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-258) pom.xml refers to jars not available in public repositories

2012-01-03 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-258:
---

Fix Version/s: (was: ManifoldCF next)
   ManifoldCF 0.3
 Assignee: Karl Wright

 pom.xml refers to jars not available in public repositories
 ---

 Key: CONNECTORS-258
 URL: https://issues.apache.org/jira/browse/CONNECTORS-258
 Project: ManifoldCF
  Issue Type: Bug
  Components: Build
Affects Versions: ManifoldCF 0.4
 Environment: all supported platforms
Reporter: Alex Ott
Assignee: Karl Wright
Priority: Minor
  Labels: maven
 Fix For: ManifoldCF 0.3

 Attachments: mvn-bootstrap.sh


 Maven's pom.xmls refers to jars that aren't available in public repositories, 
 as maven central, apache repository, etc. This includes:
  - com.bitmechanic:jdbcpool
  - org.hsqldb:hsqldb:jar:2.2.5.6-9-2011 (at maven central only version 2.2.4 
 is available right now)
 I think, that ManifoldCF should adopt the same approach as other Apache 
 projects, like Tika, when all needed jars first promoted to public 
 repositories, and only after that, they are used as dependency...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (CONNECTORS-258) pom.xml refers to jars not available in public repositories

2012-01-03 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-258.


Resolution: Fixed

The mvn-bootstrap .sh/.bat has been part of manifoldcf since the 0.3 release.

 pom.xml refers to jars not available in public repositories
 ---

 Key: CONNECTORS-258
 URL: https://issues.apache.org/jira/browse/CONNECTORS-258
 Project: ManifoldCF
  Issue Type: Bug
  Components: Build
Affects Versions: ManifoldCF 0.4
 Environment: all supported platforms
Reporter: Alex Ott
Assignee: Karl Wright
Priority: Minor
  Labels: maven
 Fix For: ManifoldCF 0.3

 Attachments: mvn-bootstrap.sh


 Maven's pom.xmls refers to jars that aren't available in public repositories, 
 as maven central, apache repository, etc. This includes:
  - com.bitmechanic:jdbcpool
  - org.hsqldb:hsqldb:jar:2.2.5.6-9-2011 (at maven central only version 2.2.4 
 is available right now)
 I think, that ManifoldCF should adopt the same approach as other Apache 
 projects, like Tika, when all needed jars first promoted to public 
 repositories, and only after that, they are used as dependency...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-318) Make it easier to trace XML parsing errors

2012-01-03 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-318:
---

Affects Version/s: ManifoldCF 0.5
Fix Version/s: ManifoldCF 0.5
 Assignee: Karl Wright

Although it's far better to have the Solr connector handle its own diagnostics, 
this patch may still be helpful upon occasion, so I recommend committing it.


 Make it easier to trace XML parsing errors
 --

 Key: CONNECTORS-318
 URL: https://issues.apache.org/jira/browse/CONNECTORS-318
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Framework core
Affects Versions: ManifoldCF 0.5
Reporter: Martin Goldhahn
Assignee: Karl Wright
Priority: Minor
 Fix For: ManifoldCF 0.5

 Attachments: XMLDoc.java.patch


 I had a hard time tracking an erroneous response from Solr. All I got was 
 something like this:
 {{[Fatal Error] :112:120: The element type HR must be terminated by the 
 matching end-tag /HR.}}
 There was no indication what the error was an what component issued the error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

2012-01-03 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-309:
---

Fix Version/s: (was: ManifoldCF next)
   ManifoldCF 0.5
 Assignee: Karl Wright

As stated this looks straightforward and will probably fit in the 0.5 timeframe.

 On Canonicalization Tab , Allow regex transforms to modify the URL's for a 
 crawl
 

 Key: CONNECTORS-309
 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.4
Reporter: Michael J. Kelleher
Assignee: Karl Wright
Priority: Minor
 Fix For: ManifoldCF 0.5


 There was not a Component for a Job.  Canonicalization is part of the Job 
 definition.
 I would like the ability to use a regex to transform a URL (not necessarily 
 including the hostname and port).  Specifically what I would like to use this 
 for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-292) Problems compiling agents, pull-agent, connectors/filesystem, etc directly in Maven

2012-01-03 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179064#comment-13179064
 ] 

Karl Wright commented on CONNECTORS-292:


Given the open-endedness of this ticket, I'm going to triage it into 
ManifoldCF-next.


 Problems compiling agents, pull-agent, connectors/filesystem, etc directly in 
 Maven
 ---

 Key: CONNECTORS-292
 URL: https://issues.apache.org/jira/browse/CONNECTORS-292
 Project: ManifoldCF
  Issue Type: Bug
  Components: Build
Affects Versions: ManifoldCF 0.4
 Environment: java 6, maven 3.0.3, ManifoldCF trunk version from 
 http://svn.apache.org/repos/asf/incubator/lcf/trunk
Reporter: Luca Stancapiano
Priority: Minor
 Fix For: ManifoldCF next


 if I try to execute the command 'mvn install -Dmaven.test.skip inside 
 'framework/agents' I get this error:
 [ERROR] Failed to execute goal on project mcf-agents: Could not resolve 
 dependencies for project org.apache.manifoldcf:mcf-agents:jar:0.4.0-SNAPSHOT: 
 Failure to find org.apache.manifoldcf:mcf-core:jar:tests:0.4.0-SNAPSHOT in 
  was cached in the local repository, resolution will not be 
 reattempted until the update interval of sose-private has elapsed or updates 
 are forced - [Help 1]
 In the pom.xml of the mcf-agents project there is a wrong dependency:
 dependency
   groupId${project.groupId}/groupId
   artifactIdmcf-core/artifactId
   version${project.version}/version
   typetest-jar/type
   scopetest/scope
 /dependency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-351) Alfresco Connector documentation must be updated

2012-01-03 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-351:
---

Affects Version/s: ManifoldCF 0.5
Fix Version/s: ManifoldCF 0.5

 Alfresco Connector documentation must be updated
 

 Key: CONNECTORS-351
 URL: https://issues.apache.org/jira/browse/CONNECTORS-351
 Project: ManifoldCF
  Issue Type: Bug
  Components: Documentation
Affects Versions: ManifoldCF 0.5
Reporter: Piergiorgio Lucidi
Assignee: Piergiorgio Lucidi
 Fix For: ManifoldCF 0.5

   Original Estimate: 2h
  Remaining Estimate: 2h

 The Alfresco connector documentation must be updated with the new tenant 
 domain parameter (text and screenshots).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-345) Jetty Configuration Support

2012-01-03 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-345:
---

Fix Version/s: ManifoldCF 0.5
 Assignee: Karl Wright

 Jetty Configuration Support
 ---

 Key: CONNECTORS-345
 URL: https://issues.apache.org/jira/browse/CONNECTORS-345
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Framework core
Affects Versions: ManifoldCF 0.4
 Environment: Jetty Configuration
Reporter: Michael J. Kelleher
Assignee: Karl Wright
 Fix For: ManifoldCF 0.5


 Can the single process example be extended to support Jetty configuration?
 1) jetty.xml
 2) webdefault.xml
 3) OPTIONS= along with their corresponding XML config files, most 
 importantly the JMX option, Server,ajp,setuid would be nice to have

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Announcing the availability of UI testing infrastructure

2012-01-03 Thread Karl Wright
Folks,

When ManifoldCF was granted by MetaCarta, comprehensive tests existed
for the main Crawler UI as well as the UI contributions of each
connector.  This testing was all done in Python, and was thus
unavailable within Junit, even though the MetaCarta test code itself
had been granted, including the Python browser emulator (which I had
written).

My original plans had been to port the browser emulator to Java.  I
kept starting to do this but other tasks continually interfered.
Eventually in December I finally gave up on having enough of a block
of time to do the port, and created infrastructure instead that
invokes Python directly from within the Junit test framework.  So we
now have limited but sufficient capability for testing connector UIs.

In order to use the tester, all you have to do is the following:
- Install Python 2.x on the computer you intend to test with.
- Make sure that typing the command python brings up the python shell.
- Execute ant uitest.

Currently tests exist for the filesystem connector, the rss connector,
and the web connector (which I'm currently completing).  To write your
own test, have a look at the code in
tests/rss/src/test/java/org/apache/manifoldcf/rss_tests/NavigationDerbyUI.java.
 It should be pretty self-explanatory.  Ask questions if it isn't.

I think we should have UI tests for all connectors before we ship 0.5,
so if you own a connector please consider adding such a test.  Bear
in mind that the UI tester is NOT going to emulate IE or Firefox, but
is only capable of doing the basics.  Thus, there are plenty of things
you can do in Javascript in a browser that won't work in the tester.
If you are trying to do something in your UI that the tester does not
like, usually the best solution is to simply do it in a different way.
 If that can't be done, we can augment the tester as needed.  Let me
know if you run into this problem and I'd be happy to help.

The tester is also rigorous about properly formed HTML, which is good
since most browsers silently accept crappy HTML and then break things
in different ways.

Thanks!
Karl