Sue,

There were some improvements to 'filter-media' in DSpace 1.5.x. 
Primarily, there's the addition of two new PDF-specific settings in the 
dspace.cfg:

pdffilter.largepdfs = true
pdffilter.skiponmemoryexception = true

The former ensures that all PDF text-extractions are written to 
temporary files during indexing.  This helps avoid OutOfMemoryException 
& Heap space errors that were occasionally caused by larger PDFs being 
loaded into system memory all at once.

The latter attempts to skip over any PDFs which still cause an 
OutOfMemoryException.  So, if that exception still occurs on a PDF, then 
the PDF is skipped entirely and *not* indexed.  This helps to avoid the 
entire 'filter-media' script "crashing" when an OutOfMemoryException 
occurs (which used to happen in 1.4.2).

Despite these changes in 1.5.x, there is NO guarantee that *all* of your 
PDFs will index properly.  As I've mentioned before, the 'filter-media' 
script uses third-party software (called PDFBox: http://www.pdfbox.org/) 
for indexing of PDF files.  There are some known bugs in PDFBox that 
have yet to be fixed, so it does *not* always work for all PDFs.   In 
some cases, PDFBox will also work inconsistently (and I don't know why 
that is).  I've run into some inconsistency problems with larger-sized 
PDFs, which are originally scanned documents with embedded OCR. 
Occasionally PDFBox will index them fine, and other times it will cause 
an OutOfMemoryException (which, with DSpace 1.5 means that 
'filter-media' will just skip that pdf).

So, I guess the best way to sum this up is that DSpace currently cannot 
successfully index 100% of all PDFs, since PDFBox cannot do so.  DSpace 
1.5 has improvements in helping DSpace to safely handle PDFBox issues 
(like the OutOfMemoryExceptions), but it doesn't necessarily have 
drastic improvements in indexing capabilities.

I answered your other questions inline below...


Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:

> 1.                   Has the filter-media/index-all process changed 
> and/or improved significantly in DSpace 1.5?  If so, we may just shelve 
> this issue until we’ve implemented 1.5.

See above, obviously...

> 2.                   In DSpace 1.4.2 (and 1.5), does it matter whether 
> your .txt files are plain or accessible .txt files?  Can index-all 
> process either type?

For text files, it doesn't really matter...in either case the 
'filter-media' script just pulls out the plain text for indexing.  I 
don't believe there'd be any significant difference between the "type" 
of .txt file.

However, it's worth making this clear: for .txt files, you *still* need 
to run the 'filter-media' script for them to be indexed by 'index-all'. 
  Essentially, 'index-all' only indexes plain text files in the "TEXT" 
bundle.  The 'filter-media' script is what adds plain text to the "TEXT" 
bundle.

>  
> 
> 3.                   If the process in 1.5 hasn’t changed and/or 
> improved significantly in 1.5, we are considering having our scanning 
> folks just create the .txt files along with the .pdf files at the time 
> the documents are scanned.  Then when they send them to us, we would 
> just upload them in the import process along with the .pdf files for 
> each Item.  The only thing we’d really have to change in our import 
> process is the addition of a second file name in the “contents” file and 
> the addition of the .txt document in the Item’s import directory (right 
> along with the .pdf file).  One other issue is we might have to make a 
> small modification to DSpace to **not** display the .txt file on the 
> Item page unless the User is in the Admin interface since we wouldn’t 
> want our Users clicking on/opening the .txt files.  If we did this, we 
> could completely eliminate the filter-media job altogether.  This would 
> ensure that we did not load any “unfilterable” documents into DSpace.  
> It would also eliminate the tedious process of identifying which 
> documents did not filter successfully, and the whole process of 
> rescanning and replacing them in DSpace.

This sounds like a perfectly reasonable way of doing things, assuming 
you have the staff time to pre-generate those .txt files.  You are 
correct that you'd no longer need to run 'filter-media' on those PDFs. 
But, you'd still need to run 'filter-media' to index those .txt files. 
You could do this by modifying the "Media Filter" settings in your 
dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media' 
would no longer filter PDFs, but it would work on the other types of 
content).

It would also require some custom coding to hide those .txt files from 
normal users, but that shouldn't be too horrible.

If you did go this route, I'd make sure that you still OCR the PDFs that 
you put in, as it improves their accessibility overall.

Hopefully that all makes sense...definitely let us know if you have 
further questions.

- Tim

-- 
Tim Donohue
Research Programmer, IDEALS
http://www.ideals.uiuc.edu/
University of Illinois
tdono...@illinois.edu | (217) 333-4648

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Tim Donohue
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
        • Re: [... Tim Donohue
          • R... Diggory Mark
            • ... Tim Donohue
              • ... Claudia Jürgen
              • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
                • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
            • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to