Hi all,

I noticed some confusion lately about the different components and their
purpose in Droids.

The following I added last night to the documentation:
"Droids (plural) is not designed for a special usecase, it is a
framework: Take what you need, do what you want, impossible is nothing.

It is the cocoon/UNIX philosophy for automated task processing in java.
As a reminder a pipe in unix starts with an invoking component (which
produces a stream) and then chain as much other components that interact
on the stream that are needed. The modification of each component will
be passed to the next component in the chain. 

For example the following command in a unix box will lance a subversion
command to check for the status on the local svn checkout (svn st). The
next command will filter the files that are not under svn control
(grep ?). The next command will modify the stream to create a command to
add this files to the repository (awk ...). The last step will cause the
invocation of the command by sending it to the shell (sh).

svn st | grep ? | awk '{print "svn add "$2}' | sh

In droids your are piping/processing your tasks with small specialist
components that combined are resolving your task. 

Droids offers you following the components so far:

      * Queue, a queue is the data structure where the different tasks
        are waiting for service.
      * Protocol, the protocol interface is a wrapper to hide the
        underlying implementation of the communication at protocol
        level.
      * Parser -> Apache Tika, the parser component is just a wrapper
        for tika since it offers everything we need. No need to
        duplicate the effort. The Paser component parses different input
        types to SAX events.
      * Handler, a handler is a component that uses the original stream
        and/or the parse (ContentHandler coming from Tika) and the url
        to invoke arbitrary business logic on the objects. Unless like
        the other components different handler can be applied on the
        stream/parse

A Droid (singular) however is all about ONE special usecase. For example
the helloCrawler is a wget style crawler. Meaning you go to a page
extract the links and save the page afterward to the file system. The
focus of the helloCrawler is this special usecase and to solve it hello
uses different components.

In the future there could evolve different subprojects that are
providing specialist components for a special use case. However if
components get used in different usecases they should be considered
common."

In the light of LABS-149 and the move to reuse 100% tika in the parser
phase (LABS-118) we will need a new component. A LinkExtractor or with a
more generic name TaskExtractor. 

This extractor component should act on the SAX events the parser
produces and return the Outlinks that meeting the filter conditions.

The helloCrawler has following flow (as soon tika integration has
finished and the extractor component is added):
queue -> protocol (opens stream) -> parser (receives stream and
transforms it to SAX) -> extractor (since we are crawling we want to
extract links from the stream) -> handler (use the original stream and
save it to disk)

The helloCrawler is a crawler meaning we have a single page to start the
queue and while processing extracting new tasks changing the queue.

There is as well another typical use case for droids. I will call them
"racer" (anagram of crawler). Racer are not trying to extract new tasks
they start with a limited number of task that are defined in the
initQueue method. 

I will try to add tonight an example of a file racer since I have a nice
use case (I need to clean up the names of various files in a directory -
removing special characters and bring them in a special form). 

Hope that clears up a bit the different components.

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to