Hi Karl, On Wed, Jan 11, 2012 at 4:21 AM, Karl Wright <daddy...@gmail.com> wrote:
> Hi Mark, > > I think I'd describe this simplified proposal as "pipeline" (vs. > "Pipeline". Your original description was the latter.) This proposal > is simpler but does not have the ability to amalgamate content from > multiple connectors, correct? Yes. > As long as it is just modifying the > content and metadata (as described by RepositoryDocument), it's not > hard to develop a generic idea of a content processing pipeline, e.g. > Tika. > Yay! > > There's a question in my mind as to where it belongs. If its purpose > is to make up for missing code in particular search engines, then I'd > argue it should be a service available to output connector coders, who > can then choose how much configurability makes sense from the point of > view of their target system. I'm not sure if this question is revisiting the motivation for preferring this in MCF, or a technical question about how to package metadata for different engines that might want it in a different format. For the former, I'd briefly rehash my answer earlier in the thread. Pipelines are not in every search engine, and many organizations deal with multiple search engines, so having a more standard for that logic would be awesome! For the latter, how to pass metadata to engines, that's interesting. One almost universal way is to add metadata tags the header portion of an HTML file. There are some other microformats that some engines understand. Could we just assume, for now, that additional meta data will be jammed into the HTML header, perhaps with an "x-" for the name (a convention some folks like). > For instance, since Tika is already part > of Solr, there would seem little benefit in adding a Tika pipeline > upstream of Solr as well, but maybe a Google Appliance connector would > want it and therefore expose it. Including Tika would be useful for connectors that need to look at binary doc files to do their parsing. Even if the pipeline then discards Tika's output when it's done, it's still a likely expense *if* it's meets the project objective. As an example, the current MCF system looks for links in HTML. But hyperlinks can also appear in Word, Excel and PDF files. Tika could, in theory, convert those docs so that they cal also be scanned for links, and then later discard that converted file. Another attractive pipeline would have Tika convert non-HTML binary files into some primitive for of HTML and then perhaps discard the original binary. So binary content would be transformed into HTML or TXT in the MCF stage, INSTEAD of having a search engine (or other system to do it). It's still a 1:1 transformation. > If the pipeline's purpose is to > include arbitrary business logic, on the other hand, then I think what > you'd really need is a Pipeline and not a pipeline, if you see what I > mean. > Given the dismal state of open tools, I'd be excited to just see 1:1 "pipeline" functionality be made widely available. I'm regretting, to some extent, bringing in the more complex Pipeline logic as it may have partially derailed the conversation. I'm one of the authors of the old XPump tool, which was able to do very fancy things, but suffered from other issues. But better to have something now then nothing. And I'll ponder the more complex scenarios some more. > > So, my question to you is, what would the main use case(s) be for a > "pipeline" in your view? > I've given a couple examples above, of 1:1 transforms. I *KNOW* this is of interest to some folks, but it sounds like I've failed to convince you. I'd ask you to take it on faith, but you don't know me very well, so that'd be asking a lot. >From an academic exercise, suppose it was given that this was a good thing to do, then what would be the easiest way to do it? A final question for you Karl, since we've both invested some time in discussing something that would normally be very complex to others. What open source tools would YOU suggest I look at, for a new home for uber pipeline processing? I think you understand some of the logical functionality I want to model. Some other wish list items: * Leverage MCF connectors * A web UI framework for monitoring I'd say up front that I've considered Nutch, but I don't think it's a good fit for other reasons. I'm still looking around at UIMA. I keep finding the justification for UIMA, how awesome it is, but less on the technical side. I'm not sure it models a data flow design that well. The other area I looked at was some of the Eclipse process graph stuff, "Business Process Management" I think. There's a TON of open source projects. > > Karl > > On Wed, Jan 11, 2012 at 6:31 AM, Mark Bennett <mbenn...@ideaeng.com> > wrote: > > Hi Karl, > > > > Still pondering our last discussion. Wondering if I got things off > track. > > > > As a start, what if I backtracked a bit, to this: > > > > What's the easiest way to do this: > > * A connector that tweaks metadata form a single source. > > * Sits between any existing MCF datasource connector and the main MCF > engine > > > > Before: > > > > CMS/DB -> Existing MCF connector -> MCF core -> output > > > > After: > > > > CMS/DB -> Existing MCF connector -> Metadata tweaker -> MCF core -> > output > > > > > > Assume the matadata changes don't have any impact on security, or that no > > security is being used (public data) >