Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Wed, 21 Jan 2009 12:50:17 -0800

Hi Tim,
     Thanks to all for the suggestions.  Basically I am trying to
prevent filter-media from attempting to filter our .pdf files and I want
index-all to index only our .txt files.


     So if I remove the pdffilter parameters from dspace.cfg and I have
all our .txt files in the TEXT bundle (using one of the 3 options you
outlined), this should work and we shouldn't have to run filter-media at
all, right?

Thanks again,
Sue

-----Original Message-----
From: Tim Donohue [mailto:tdono...@illinois.edu] 
Sent: Wednesday, January 21, 2009 10:54 AM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W.
(LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI
INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION
SYSTEMS]
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Sue,

Sorry, we've all been talking across each other a bit.  As you can 
probably tell, there's really no "correct" answer on how to do this, 
rather there's a variety of options to choose from

Essentially, you have 3 options that have been laid out by Mark, Claudia

and myself.  I'm not certain which will be *easiest* off the top of my
head:

[Option 1]  Add the *.txt files to the "ORIGINAL" bundle (which is where

they are added by default).  If they are in the "ORIGINAL" bundle you 
will have to run 'filter-media' to "filter" them into the "TEXT" bundle.

   Then, you will run 'index-all' to index them for searching (as noted 
'index-all' only indexes documents in the "TEXT" bundle).  You will also

need to modify the UI if you don't want these *.txt files to be visible 
to normal users.

[Option 2]  Add the *.txt files to the "TEXT" bundle directly.  There is

no way to do this via normal DSpace user interfaces.  You can however do

this during the normal command-line bulk item import process by 
specifying a "bundle" name in the 'contents' file.  See the DSpace Docs 
for more information on this:
http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic
ation.html#itemimporter

[Option 3]  Claudia's suggestion is very similar to Option #1.  However,

as she notes and easy way to "hide" the *.txt files from the UI is to go

into the DSpace Administration UI (specifically the "Bitstream Format 
Registry" and mark the *.txt format as "internal").  This tells DSpace 
that ALL *.txt files should be considered internal files, and should 
NEVER be displayed in the UI.  So, you'd only want to do this if you 
never want any *.txt files to be displayed from the UI.


In my opinion (others may have differing opinions), it'd be safer & 
potentially easier to go with either option #1 or #3.  The danger of 
option #2 is that the "TEXT" bundle tends to be managed by the 
"filter-media" script in DSpace.  As long as you are always aware that 
you manually added files to this bundle, you should be fine.  But, if 
you ever ran 'filter-media' in "force" mode (with the -f option), 
there'd be a possibility the 'filter-media' script would overwrite all 
your manually added *.txt files in that bundle.

Hopefully that gives you a decent lay of the land.  There may be yet 
other options out there, but at least this gives you a few to work off
of.

- Tim



Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> I did the following query against the bundle table and it seems we
only 
> have 3 bundle "name"s in the table:  LICENSE, ORIGINAL, & TEXT:
> 
>  */select count(*)/*
> 
> */      , name /*
> 
> */ from bundle /*
> 
> */  group by 2 /*
> 
> */  order by 2/*
> 
> */ /*
> 
> All the .txt files we created in our 1000 document test are in the 
> ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
> this query and then run index-all, these .txt files should be 
> searchable, correct?
> 
>   */UPDATE bundle/*
> 
> */  SET name = 'TEXT'/*
> 
> */  WHERE bundle_id = /*
> 
> */     (SELECT bu.bundle_id /*
> 
> */         FROM bitstream bi/*
> 
> */            , bundle2bitstream b2b/*
> 
> */            , bundle    bu/*
> 
> */         WHERE bi.bitstream_id = b2b.bitstream_id/*
> 
> */           AND b2b.bundle_id   = bu.bundle_id/*
> 
> */           AND bundle.bundle_id = bu.bundle_id/*
> 
> */           AND bu.name = 'ORIGINAL'/*
> 
> */           AND bi.name LIKE '%.txt')   /*
> 
>  
> 
> Let me know what you think.
> 
> Thanks again,
> 
> Sue
> 
>  
> 
>  
> 
> -----Original Message-----
> From: Tim Donohue [mailto:tdono...@illinois.edu]
> Sent: Tuesday, January 20, 2009 2:12 PM
> To: Diggory Mark
> Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; 
> dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI 
> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION 
> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
> 
>  
> 
> Mark,
> 
>  
> 
> That's correct, that the indexer only indexes files in the TEXT
bundle.
> 
>   But, that's why I had recommended to Susan to first run
'filter-media'
> 
> script.   The 'filter-media' script will take text files in the
CONTENT
> 
> bundle and essentially copy them over to the TEXT bundle for indexing.
> 
>  
> 
> So, you are correct that the *.txt files could be immediately put in
the
> 
> TEXT bundle (which would also avoid them being exposed publicly).
But,
> 
> the alternative would be to put the *.txt files in the CONTENT bundle
> 
> and run 'filter-media' to "filter" it into the TEXT bundle. (However,
as
> 
> you noted, this latter option would require UI alteration to hide the
> 
> *.txt files, if they shouldn't be accessible).
> 
>  
> 
> - Tim
> 
>  
> 
> Diggory Mark wrote:
> 
>>  Actually...
> 
>>
> 
>>  Looking at the code of DSIndexer... I'm sure, written by among
others...
> 
>>  myself.  We find that only Bitstreams within the "TEXT" bundle are
> 
>>  actually indexed into Lucene:
> 
>>
> 
>> >  for (int i = 0; i < myBundles.length; i++)
> 
>> >             {
> 
>> >                 if ((myBundles[i].getName() != null)
> 
>> >                         && myBundles[i].getName().equals("TEXT"))
> 
>> >                 {
> 
>>
> 
>>  I'm thinking this was a short-sightedness, but the unhappy
consequence
> 
>>  of which is that your text files will not get indexed if you place
them
> 
>>  into the "CONTENT" Bundle.  There are two solutions
> 
>>
> 
>>  A.) Put your text bitstreams into the TEXT bundle and not have to
worry
> 
>>  about them being exposed because the TEXT bundle will not be.
> 
>>
> 
>>  B.) Put your text Bitstreams in the Content Bundle, alter the UI to
hide
> 
>>  them, and alter DSIndexer to index the CONTENT bundle.
> 
>>
> 
>>  Mark
> 
>>
> 
>>  On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:
> 
>>
> 
>> > Susan,
> 
>> > 
> 
>> > Actually, the setting you'd want to change in your DSpace 1.4.2
> 
>> > dspace.cfg is this one:
> 
>> > 
> 
>> > plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...
> 
>> > 
> 
>> > You'd want to remove the entry for:
> 
>> > "org.dspace.app.mediafilter.PDFFilter"
> 
>> > 
> 
>> > That'd ensure that the PDFFilter is no longer used by filter-media.
The
> 
>> > setting that you referenced below just configures the PDF filter to
> 
>> > process files which are "Adobe PDF" format.
> 
>> > 
> 
>> > [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
> 
>> > "plugin.sequence.org.dspace.app.mediafilter.MediaFilter" setting no
> 
>> > longer exists.  Instead, it was replaced by a more simplistic
> 
>> > "filter.plugins" setting.  In that case, for DSpace 1.5.x, you'd
just
> 
>> > remove "PDF Text Extractor" from the list of enabled
"filter.plugins".
> 
>> > Again, this would ensure that 'filter-media' would no longer use
the PDF
> 
>> > filter.
> 
>> > 
> 
>> > Hopefully that all makes sense...Beyond that, as you mentioned,
you'd
> 
>> > just need to hide those '*.txt' files from being displayed.
> 
>> > 
> 
>> > - Tim
> 
>> > 
> 
>> > 
> 
>> > 
> 
>> > Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> 
>> >> Hi Tim,
> 
>> >> 
> 
>> >>     So you're saying that our proposed solution would work as long
as
> 
>> >> we remove (or comment out):
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe
PDF*
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> from dspace.cfg and make the change to not display the .txt files
on the
> 
>> >> Item pages?
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Then we would still need to run filter-media which would only be
to
> 
>> >> basically add our .txt files to the TEXT bundle for each Item?
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> By the way, we have been using the 1.5 version of filter-media,
with the
> 
>> >> addition of the two new configuration parameters in dspace.cfg,
for
> 
>> >> awhile, even though we are running DSpace 1.4.2.  I did this
awhile back
> 
>> >> and yes, it has stopped the JAVA heap space errors from killing
> 
>> >> filter-media midstream.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> I do think this new plan is the better way to go for us.  I
believe the
> 
>> >> advantages would be:
> 
>> >> 
> 
>> >> 1.  No more filter-media running for soooo long - over 24 hours
most of
> 
>> >> the time.
> 
>> >> 
> 
>> >> 2.  We would identify "problematic" .pdf files (ones that possibly
> 
>> >> wouldn't filter) prior to importing them into DSpace, instead of
> 
>> >> after-the-fact.  When these problems are caught at the scanning
point,
> 
>> >> they could be dealt with there and then (rescanning/re-ocr'ing,
etc).
> 
>> >> 
> 
>> >> 3.  Our Users wouldn't have such a big job of identifying the
> 
>> >> "unfilterable" documents, locating them for rescanning, getting
them
> 
>> >> back to us for re-import, etc etc.
> 
>> >> 
> 
>> >> 4.  Bottom line would be a more accurate full-text searchable
> 
>> >> repository.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Thanks a bunch for the detailed feedback.  We are processing a
1000
> 
>> >> document test with this new procedure and will let you know how it
> 
>> >> goes!!
> 
>> >> 
> 
>> >> Sue
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> -----Original Message-----
> 
>> >> From: Tim Donohue [mailto:tdono...@illinois.edu]
> 
>> >> Sent: Thursday, January 15, 2009 11:27 AM
> 
>> >> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
> 
>> >> Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W.
(LARC-B7)[NCI
> 
>> >> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI
INFORMATION
> 
>> >> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
> 
>> >> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media
questions
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Sue,
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> There were some improvements to 'filter-media' in DSpace 1.5.x.
> 
>> >> 
> 
>> >> Primarily, there's the addition of two new PDF-specific settings
in the
> 
>> >> 
> 
>> >> dspace.cfg:
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> pdffilter.largepdfs = true
> 
>> >> 
> 
>> >> pdffilter.skiponmemoryexception = true
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> The former ensures that all PDF text-extractions are written to
> 
>> >> 
> 
>> >> temporary files during indexing.  This helps avoid
OutOfMemoryException
> 
>> >> 
> 
>> >> & Heap space errors that were occasionally caused by larger PDFs
being
> 
>> >> 
> 
>> >> loaded into system memory all at once.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> The latter attempts to skip over any PDFs which still cause an
> 
>> >> 
> 
>> >> OutOfMemoryException.  So, if that exception still occurs on a
PDF, then
> 
>> >> 
> 
>> >> the PDF is skipped entirely and *not* indexed.  This helps to
avoid the
> 
>> >> 
> 
>> >> entire 'filter-media' script "crashing" when an
OutOfMemoryException
> 
>> >> 
> 
>> >> occurs (which used to happen in 1.4.2).
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Despite these changes in 1.5.x, there is NO guarantee that *all*
of your
> 
>> >> 
> 
>> >> PDFs will index properly.  As I've mentioned before, the
'filter-media'
> 
>> >> 
> 
>> >> script uses third-party software (called PDFBox:
http://www.pdfbox.org/)
> 
>> >> 
> 
>> >> for indexing of PDF files.  There are some known bugs in PDFBox
that
> 
>> >> 
> 
>> >> have yet to be fixed, so it does *not* always work for all PDFs.
In
> 
>> >> 
> 
>> >> some cases, PDFBox will also work inconsistently (and I don't know
why
> 
>> >> 
> 
>> >> that is).  I've run into some inconsistency problems with
larger-sized
> 
>> >> 
> 
>> >> PDFs, which are originally scanned documents with embedded OCR.
> 
>> >> 
> 
>> >> Occasionally PDFBox will index them fine, and other times it will
cause
> 
>> >> 
> 
>> >> an OutOfMemoryException (which, with DSpace 1.5 means that
> 
>> >> 
> 
>> >> 'filter-media' will just skip that pdf).
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> So, I guess the best way to sum this up is that DSpace currently
cannot
> 
>> >> 
> 
>> >> successfully index 100% of all PDFs, since PDFBox cannot do so.
DSpace
> 
>> >> 
> 
>> >> 1.5 has improvements in helping DSpace to safely handle PDFBox
issues
> 
>> >> 
> 
>> >> (like the OutOfMemoryExceptions), but it doesn't necessarily have
> 
>> >> 
> 
>> >> drastic improvements in indexing capabilities.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> I answered your other questions inline below...
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >>> 1.                   Has the filter-media/index-all process
changed
> 
>> >> 
> 
>> >>> and/or improved significantly in DSpace 1.5?  If so, we may just
shelve
> 
>> >> 
> 
>> >>> this issue until we've implemented 1.5.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> See above, obviously...
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >>> 2.                   In DSpace 1.4.2 (and 1.5), does it matter
whether
> 
>> >> 
> 
>> >>> your .txt files are plain or accessible .txt files?  Can
index-all
> 
>> >> 
> 
>> >>> process either type?
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> For text files, it doesn't really matter...in either case the
> 
>> >> 
> 
>> >> 'filter-media' script just pulls out the plain text for indexing.
I
> 
>> >> 
> 
>> >> don't believe there'd be any significant difference between the
"type"
> 
>> >> 
> 
>> >> of .txt file.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> However, it's worth making this clear: for .txt files, you *still*
need
> 
>> >> 
> 
>> >> to run the 'filter-media' script for them to be indexed by
'index-all'.
> 
>> >> 
> 
>> >>  Essentially, 'index-all' only indexes plain text files in the
"TEXT"
> 
>> >> 
> 
>> >> bundle.  The 'filter-media' script is what adds plain text to the
"TEXT"
> 
>> >> 
> 
>> >> bundle.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >>> 
> 
>> >> 
> 
>> >>> 
> 
>> >> 
> 
>> >>> 3.                   If the process in 1.5 hasn't changed and/or
> 
>> >> 
> 
>> >>> improved significantly in 1.5, we are considering having our
scanning
> 
>> >> 
> 
>> >>> folks just create the .txt files along with the .pdf files at the
time
> 
>> >> 
> 
>> >>> the documents are scanned.  Then when they send them to us, we
would
> 
>> >> 
> 
>> >>> just upload them in the import process along with the .pdf files
for
> 
>> >> 
> 
>> >>> each Item.  The only thing we'd really have to change in our
import
> 
>> >> 
> 
>> >>> process is the addition of a second file name in the "contents"
file
> 
>> >>> and
> 
>> >> 
> 
>> >>> the addition of the .txt document in the Item's import directory
(right
> 
>> >> 
> 
>> >>> along with the .pdf file).  One other issue is we might have to
make a
> 
>> >> 
> 
>> >>> small modification to DSpace to **not** display the .txt file on
the
> 
>> >> 
> 
>> >>> Item page unless the User is in the Admin interface since we
wouldn't
> 
>> >> 
> 
>> >>> want our Users clicking on/opening the .txt files.  If we did
this, we
> 
>> >> 
> 
>> >>> could completely eliminate the filter-media job altogether.  This
would
> 
>> >> 
> 
>> >>> ensure that we did not load any "unfilterable" documents into
DSpace.
> 
>> >> 
> 
>> >>> It would also eliminate the tedious process of identifying which
> 
>> >> 
> 
>> >>> documents did not filter successfully, and the whole process of
> 
>> >> 
> 
>> >>> rescanning and replacing them in DSpace.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> This sounds like a perfectly reasonable way of doing things,
assuming
> 
>> >> 
> 
>> >> you have the staff time to pre-generate those .txt files.  You are
> 
>> >> 
> 
>> >> correct that you'd no longer need to run 'filter-media' on those
PDFs.
> 
>> >> 
> 
>> >> But, you'd still need to run 'filter-media' to index those .txt
files.
> 
>> >> 
> 
>> >> You could do this by modifying the "Media Filter" settings in your
> 
>> >> 
> 
>> >> dspace.cfg and *removing* the PDFFilter from the list (so
'filter-media'
> 
>> >> 
> 
>> >> would no longer filter PDFs, but it would work on the other types
of
> 
>> >> 
> 
>> >> content).
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> It would also require some custom coding to hide those .txt files
from
> 
>> >> 
> 
>> >> normal users, but that shouldn't be too horrible.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> If you did go this route, I'd make sure that you still OCR the
PDFs that
> 
>> >> 
> 
>> >> you put in, as it improves their accessibility overall.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> Hopefully that all makes sense...definitely let us know if you
have
> 
>> >> 
> 
>> >> further questions.
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> - Tim
> 
>> >> 
> 
>> >> 
> 
>> >> 
> 
>> >> --
> 
>> >> 
> 
>> >> Tim Donohue
> 
>> >> 
> 
>> >> Research Programmer, IDEALS
> 
>> >> 
> 
>> >> http://www.ideals.uiuc.edu/
> 
>> >> 
> 
>> >> University of Illinois
> 
>> >> 
> 
>> >> tdono...@illinois.edu | (217) 333-4648
> 
>> >> 
> 
>> > 
> 
>> > --
> 
>> > Tim Donohue
> 
>> > Research Programmer, IDEALS
> 
>> > http://www.ideals.uiuc.edu/
> 
>> > University of Illinois
> 
>> > tdono...@illinois.edu | (217) 333-4648
> 
>> > 
> 
>> > 
>
------------------------------------------------------------------------
------ 
> 
> 
>> > 
> 
>> > This SF.net email is sponsored by:
> 
>> > SourcForge Community
> 
>> > SourceForge wants to tell your story.
> 
>> > http://p.sf.net/sfu/sf-spreadtheword
> 
>> > _______________________________________________
> 
>> > DSpace-tech mailing list
> 
>> > DSpace-tech@lists.sourceforge.net
> 
>> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
>>
> 
>>  ~~~~~~~~~~~~~
> 
>>  Mark R. Diggory
> 
>>  http://purl.org/net/mdiggory/homepage
> 
>>
> 
>>
> 
>>
> 
>>
> 
>  
> 
> -- 
> 
> Tim Donohue
> 
> Research Programmer, IDEALS
> 
> http://www.ideals.uiuc.edu/
> 
> University of Illinois
> 
> tdono...@illinois.edu | (217) 333-4648
> 

-- 
Tim Donohue
Research Programmer, IDEALS
http://www.ideals.uiuc.edu/
University of Illinois
tdono...@illinois.edu | (217) 333-4648

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Reply via email to