Hi Cory:

On Fri, 2007-05-04 at 13:52 -0400, Cory Snavely wrote:
> So you are saying that for a format of eg PDF, filter-media, during its
> traversal of the assetstore backended on eg SRB, reads the PDF from SRB,
> extracts text, and stores that as a file back in SRB. 

Yes. A little more precisely: MediaFilter does not directly traverse the
backend - rather it examines each Item in the database, then for each
bitstream in the ORIGINAL bundle of that item, if (1) the format of the
bitstream (as recorded in the database) has a filter associated with it
(as is the case with PDF), and (2) the extracted text file has not
already been created, then it reads the (e.g. PDF) file, using the
standard API (which hides the actual location of the file), extracts the
text, and stores - again using the standard API - the text as a file in
the TEXT bundle of the item.

> Then, once its
> crawl of the assetstore is done, it reads the extracted text back in
> from SRB and indexes it. The index then lives in the filesystem,
> specifically within [dspace]/search.

Yes. A little more precisely: as a convenience, by default the indexer
is invoked after MediaFilter has run (this can be defeated with a
command-line argument). But this occurs whenever the indexing is run
(e.g. when 'index-all' is run). The index files do live at
[dspace]/search, which is conventionally a local filesystem, but
certainly may be an NFS mount-point, etc
> 
> When I refer to transactions against SRB, I am assuming that those are
> generic read and write operations in DSpace methods that are calling eg
> SRB methods.

Yes, the 'BitstreamStorageManager' exports methods to read, write, etc
These constitute the API to which I was alluding.

Hope this clarifies,

Richard
> 
> Correct? 
> 
> Thanks,
> Cory
> 
> On Fri, 2007-05-04 at 09:46 -0400, Richard Rodgers wrote:
> > See notes:
> > 
> > Quoting Cory Snavely <[EMAIL PROTECTED]>:
> > 
> > > Right--I am trying to get an understand of all this in very specific
> > > terms.
> > >
> > > On Fri, 2007-05-04 at 09:23 -0400, Mark H. Wood wrote:
> > >> There are two questions here:
> > >>
> > >> 1)  Does the use of a non-filesystem asset store backend affect Lucene's
> > >>     output?  One would guess, no, since it doesn't do output to the
> > >>     asset store.
> > Correct - no. Lucene reads the file for indexing through the storage API - 
> > it
> > therefore has a BitStream, not a location on a storage device.
> >
> > >> 2)  Does the use of a non-filesystem asset store backend affect
> > >>     Lucene's input?  IOW how does Lucene, as used in DSpace, locate
> > >>     and gain access to the files it indexes?  If it doesn't go through
> > >>     the DSpace storage layer or something equivalent then indexing is
> > >>     screwed.
> > No - for the same reason. It does not circumvent the storage API or make
> > any assumptions about where the files with the text to index lives
> > >>
> > >> Ouch!  I hadn't thought about these at all.
> > >>
> > Remember, we already support SRB, (a non-local filesystem option), and 
> > indexing
> > works fine.
> 
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to