Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg:
pdffilter.largepdfs = true pdffilter.skiponmemoryexception = true The former ensures that all PDF text-extractions are written to temporary files during indexing. This helps avoid OutOfMemoryException & Heap space errors that were occasionally caused by larger PDFs being loaded into system memory all at once. The latter attempts to skip over any PDFs which still cause an OutOfMemoryException. So, if that exception still occurs on a PDF, then the PDF is skipped entirely and *not* indexed. This helps to avoid the entire 'filter-media' script "crashing" when an OutOfMemoryException occurs (which used to happen in 1.4.2). Despite these changes in 1.5.x, there is NO guarantee that *all* of your PDFs will index properly. As I've mentioned before, the 'filter-media' script uses third-party software (called PDFBox: http://www.pdfbox.org/) for indexing of PDF files. There are some known bugs in PDFBox that have yet to be fixed, so it does *not* always work for all PDFs. In some cases, PDFBox will also work inconsistently (and I don't know why that is). I've run into some inconsistency problems with larger-sized PDFs, which are originally scanned documents with embedded OCR. Occasionally PDFBox will index them fine, and other times it will cause an OutOfMemoryException (which, with DSpace 1.5 means that 'filter-media' will just skip that pdf). So, I guess the best way to sum this up is that DSpace currently cannot successfully index 100% of all PDFs, since PDFBox cannot do so. DSpace 1.5 has improvements in helping DSpace to safely handle PDFBox issues (like the OutOfMemoryExceptions), but it doesn't necessarily have drastic improvements in indexing capabilities. I answered your other questions inline below... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > 1. Has the filter-media/index-all process changed > and/or improved significantly in DSpace 1.5? If so, we may just shelve > this issue until we’ve implemented 1.5. See above, obviously... > 2. In DSpace 1.4.2 (and 1.5), does it matter whether > your .txt files are plain or accessible .txt files? Can index-all > process either type? For text files, it doesn't really matter...in either case the 'filter-media' script just pulls out the plain text for indexing. I don't believe there'd be any significant difference between the "type" of .txt file. However, it's worth making this clear: for .txt files, you *still* need to run the 'filter-media' script for them to be indexed by 'index-all'. Essentially, 'index-all' only indexes plain text files in the "TEXT" bundle. The 'filter-media' script is what adds plain text to the "TEXT" bundle. > > > 3. If the process in 1.5 hasn’t changed and/or > improved significantly in 1.5, we are considering having our scanning > folks just create the .txt files along with the .pdf files at the time > the documents are scanned. Then when they send them to us, we would > just upload them in the import process along with the .pdf files for > each Item. The only thing we’d really have to change in our import > process is the addition of a second file name in the “contents” file and > the addition of the .txt document in the Item’s import directory (right > along with the .pdf file). One other issue is we might have to make a > small modification to DSpace to **not** display the .txt file on the > Item page unless the User is in the Admin interface since we wouldn’t > want our Users clicking on/opening the .txt files. If we did this, we > could completely eliminate the filter-media job altogether. This would > ensure that we did not load any “unfilterable” documents into DSpace. > It would also eliminate the tedious process of identifying which > documents did not filter successfully, and the whole process of > rescanning and replacing them in DSpace. This sounds like a perfectly reasonable way of doing things, assuming you have the staff time to pre-generate those .txt files. You are correct that you'd no longer need to run 'filter-media' on those PDFs. But, you'd still need to run 'filter-media' to index those .txt files. You could do this by modifying the "Media Filter" settings in your dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media' would no longer filter PDFs, but it would work on the other types of content). It would also require some custom coding to hide those .txt files from normal users, but that shouldn't be too horrible. If you did go this route, I'd make sure that you still OCR the PDFs that you put in, as it improves their accessibility overall. Hopefully that all makes sense...definitely let us know if you have further questions. - Tim -- Tim Donohue Research Programmer, IDEALS http://www.ideals.uiuc.edu/ University of Illinois tdono...@illinois.edu | (217) 333-4648 ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech