Re: Revisiting: Should Manifold include Pipelines

2012-01-12 Thread Karl Wright
Hi Mark,





 I'm not sure if this question is revisiting the motivation for preferring
 this in MCF, or a technical question about how to package metadata for
 different engines that might want it in a different format.


I'm looking not so much for justification, but for enough context as
to how to structure the code.  Based on what I've heard, it probably
makes the most sense to provide a service available for both
repository connectors and output connectors to use in massaging
content.  The configuration needed for the service would therefore be
managed by the repository connector or output connector which required
the pipeline's services.


 For the latter, how to pass metadata to engines, that's interesting.  One
 almost universal way is to add metadata tags the header portion of an HTML
 file.  There are some other microformats that some engines understand.
 Could we just assume, for now, that additional meta data will be jammed
 into the HTML header, perhaps with an x- for the name (a convention some
 folks like).


I would presume that a Java coder who writes the output connector that
knows how to connect to the given search engine would tackle this
problem in the appropriate way.  I don't think it's a pipeline
question.


 Including Tika would be useful for connectors that need to look at binary
 doc files to do their parsing.  Even if the pipeline then discards Tika's
 output when it's done, it's still a likely expense *if* it's meets the
 project objective.

 As an example, the current MCF system looks for links in HTML.  But
 hyperlinks can also appear in Word, Excel and PDF files.  Tika could, in
 theory, convert those docs so that they cal also be scanned for links, and
 then later discard that converted file.


Sure, that's why I'd make the pipeline be available to every
connector.  The Java code for the connector would be modified, if
appropriate, to use the pipeline if it was helpful for it.


 Given the dismal state of open tools, I'd be excited to just see 1:1
 pipeline functionality be made widely available.

 I'm regretting, to some extent, bringing in the more complex Pipeline logic
 as it may have partially derailed the conversation.  I'm one of the authors
 of the old XPump tool, which was able to do very fancy things, but suffered
 from other issues.

 But better to have something now then nothing.  And I'll ponder the more
 complex scenarios some more.


I'll talk about this more further down.



 So, my question to you is, what would the main use case(s) be for a
 pipeline in your view?


 I've given a couple examples above, of 1:1 transforms.  I *KNOW* this is of
 interest to some folks, but it sounds like I've failed to convince you.
 I'd ask you to take it on faith, but you don't know me very well, so that'd
 be asking a lot.


The goal of the question was to confirm that you thought the value of
having a pipeline was high enough, vs. building a Pipeline, as
we've defined it.  I wanted to be sure there was no communication
issue and that we understood one another before anybody went off and
started writing code.


 A final question for you Karl, since we've both invested some time in
 discussing something that would normally be very complex to others.  What
 open source tools would YOU suggest I look at, for a new home for uber
 pipeline processing?  I think you understand some of the logical
 functionality I want to model.

 Some other wish list items:
 * Leverage MCF connectors
 * A web UI framework for monitoring

 I'd say up front that I've considered Nutch, but I don't think it's a good
 fit for other reasons.

 I'm still looking around at UIMA.  I keep finding the justification for
 UIMA, how awesome it is, but less on the technical side.  I'm not sure it
 models a data flow design that well.

 The other area I looked at was some of the Eclipse process graph stuff,
 Business Process Management I think.


 There's a TON of open source projects.


I can't claim to speak for knowing all the open-source projects out
there.  But I'm unaware of one that really focuses on Pipeline
building from the perspective of crawling.

On the other hand, it seems pretty clear to me how one would go about
converting ManifoldCF to a Pipeline project.  What you'd get would
be a tool with UI components where you'd either glue the components
together with code, or use an amalgamation UI to generate the
necessary data flow.  There may already be tools in this space I don't
know of, but before you'd get to that point you'd want to have all the
technical underpinnings worked out.

The Pipeline services you'd want to provide would include functions
that each connector currently performs, but broken out as I'd
described in one of my earlier posts.  The document queue, which is
managed by the ManifoldCF framework right now, would need to be
redesigned since the entire notion of what a job is would require
redesign in a Pipeline world.

In order to develop such a thing, I'd be tempted to say fork 

[jira] [Created] (CONNECTORS-379) Merido connector needs to be internationalized

2012-01-12 Thread Hitoshi Ozawa (Created) (JIRA)
Merido connector needs to be internationalized
--

 Key: CONNECTORS-379
 URL: https://issues.apache.org/jira/browse/CONNECTORS-379
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Priority: Minor
 Fix For: ManifoldCF 0.5


Messages in Merido connector needs to be externalized to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-379) Merido connector needs to be internationalized

2012-01-12 Thread Hitoshi Ozawa (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitoshi Ozawa updated CONNECTORS-379:
-

Attachment: CONNECTORS-379.patch

 Merido connector needs to be internationalized
 --

 Key: CONNECTORS-379
 URL: https://issues.apache.org/jira/browse/CONNECTORS-379
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Priority: Minor
  Labels: I18N
 Fix For: ManifoldCF 0.5

 Attachments: CONNECTORS-379.patch


 Messages in Merido connector needs to be externalized to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (CONNECTORS-379) Merido connector needs to be internationalized

2012-01-12 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-379.


Resolution: Fixed

r1230518


 Merido connector needs to be internationalized
 --

 Key: CONNECTORS-379
 URL: https://issues.apache.org/jira/browse/CONNECTORS-379
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Assignee: Karl Wright
Priority: Minor
  Labels: I18N
 Fix For: ManifoldCF 0.5

 Attachments: CONNECTORS-379.patch


 Messages in Merido connector needs to be externalized to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (CONNECTORS-376) Meridio connector's Japanese messages are not fully translated

2012-01-12 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-376.


Resolution: Fixed

 Meridio connector's Japanese messages are not fully translated
 --

 Key: CONNECTORS-376
 URL: https://issues.apache.org/jira/browse/CONNECTORS-376
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Assignee: Karl Wright
Priority: Minor
  Labels: I18N
 Fix For: ManifoldCF 0.5

 Attachments: CONNECTORS-376.patch


 Should translate Meridio connector's Japanese message properties

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (CONNECTORS-379) Merido connector needs to be internationalized

2012-01-12 Thread Karl Wright (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-379:
--

Assignee: Karl Wright

 Merido connector needs to be internationalized
 --

 Key: CONNECTORS-379
 URL: https://issues.apache.org/jira/browse/CONNECTORS-379
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Assignee: Karl Wright
Priority: Minor
  Labels: I18N
 Fix For: ManifoldCF 0.5

 Attachments: CONNECTORS-379.patch


 Messages in Merido connector needs to be externalized to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-376) Meridio connector's Japanese messages are not fully translated

2012-01-12 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184952#comment-13184952
 ] 

Karl Wright commented on CONNECTORS-376:


r1230518 is the commit which internationalizes the Meridio connector.


 Meridio connector's Japanese messages are not fully translated
 --

 Key: CONNECTORS-376
 URL: https://issues.apache.org/jira/browse/CONNECTORS-376
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Meridio connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Assignee: Karl Wright
Priority: Minor
  Labels: I18N
 Fix For: ManifoldCF 0.5

 Attachments: CONNECTORS-376.patch


 Should translate Meridio connector's Japanese message properties

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (CONNECTORS-380) Need a UI test for the Alfresco connector

2012-01-12 Thread Karl Wright (Created) (JIRA)
Need a UI test for the Alfresco connector
-

 Key: CONNECTORS-380
 URL: https://issues.apache.org/jira/browse/CONNECTORS-380
 Project: ManifoldCF
  Issue Type: Test
  Components: Alfresco connector
Affects Versions: ManifoldCF 0.5
Reporter: Karl Wright
Assignee: Piergiorgio Lucidi
 Fix For: ManifoldCF 0.5


The Alfresco connector needs a UI test, and needs whatever modifications are 
needed to its UI to make it testable.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-339) We need a test for all of the localized versions of the UI

2012-01-12 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184958#comment-13184958
 ] 

Karl Wright commented on CONNECTORS-339:


I've created new tickets for OpenSearchServer and Alfresco, so I can resolve 
this ticket.


 We need a test for all of the localized versions of the UI
 --

 Key: CONNECTORS-339
 URL: https://issues.apache.org/jira/browse/CONNECTORS-339
 Project: ManifoldCF
  Issue Type: Test
  Components: Tests
Affects Versions: ManifoldCF 0.5
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.5


 We need a way of testing the UI for functionality, regressions, and properly 
 formed HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (CONNECTORS-339) We need a test for all of the localized versions of the UI

2012-01-12 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-339.


Resolution: Fixed

 We need a test for all of the localized versions of the UI
 --

 Key: CONNECTORS-339
 URL: https://issues.apache.org/jira/browse/CONNECTORS-339
 Project: ManifoldCF
  Issue Type: Test
  Components: Tests
Affects Versions: ManifoldCF 0.5
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.5


 We need a way of testing the UI for functionality, regressions, and properly 
 formed HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira