Solved.

In v3.2, bitstreamformatregistry.short_description for mimetype
application/pdf is 'Adobe PDF'.  However, in my installation (for some long
lost reason) the short_description is simply 'PDF'.

Therefore in MediaFilterManager.java::filterBitstream(), the test at line
556:

  if (fmts.contains(myBitstream.getFormat().getShortDescription()))

never returns true, so no pdf files are ever processed.

As a workaround, in dspace.cfg, I changed

  filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF

to filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF, PDF

and voila!  Everything works.  I could just have easily updated
bitstreamformatregistry, but I was wary of breaking something else.

Cheers!
Bill


On Mon, Sep 23, 2013 at 11:06 AM, Bill Tantzen <[email protected]> wrote:

> Ivan,
>
> Thanks for checking in...
>
> dspace filter-media returns with exit status 0.  The dspace log shows no
> errors, just entries of the form:
>
> 2013-09-23 10:37:41,012 INFO  org.dspace.search.DSIndexer @ Writing
> Community: 2408/104859 to Index
>
> or:
>
> 2013-09-23 10:37:40,336 INFO  org.dspace.search.DSIndexer @ Writing
> Collection: 2408/55874 to Index
>
> The output from the command line is short.  Normally, I would expect to
> see a log of each bitstream examined beginning with 'FILTERED' or
> 'SKIPPED'.  Instead I see only a few errors for .doc files (Invalid Format)
> followed by a couple of SKIPPED entries for bitstreams with an existing
> .txt file.
>
> All the .pdf files are in the ORIGINAL bundle.  For instance:
>
> dspace=> select * from item2bundle where item_id = 34950;
> -[ RECORD 1 ]----
> id        | 39982
> item_id   | 34950
> bundle_id | 39983
> -[ RECORD 2 ]----
> id        | 39983
> item_id   | 34950
> bundle_id | 39984
>
> dspace=> select * from bundle where bundle_id in ( 39983, 39984 );
> -[ RECORD 1 ]--------+---------
> bundle_id            | 39983
> name                 | LICENSE
> primary_bitstream_id |
> -[ RECORD 2 ]--------+---------
> bundle_id            | 39984
> name                 | ORIGINAL
> primary_bitstream_id |
>
> dspace=> select * from bundle2bitstream where bundle_id = 39984;
> -[ RECORD 1 ]---+------
> id              | 40042
> bundle_id       | 39984
> bitstream_id    | 40065
> bitstream_order | 2
>
> dspace=> select * from bitstream where bitstream_id = 40065;
> -[ RECORD 1 ]-----------+------------------------------------------------
> bitstream_id            | 40065
> bitstream_format_id     | 3
> name                    | 8175706.pdf
> size_bytes              | 6587102
> checksum                | 164de17195af1d0de45cd17a431fc2b9
> checksum_algorithm      | MD5
> description             |
> user_format_description |
> source                  | /dspace/assetstore/dspace-sr/upload/8175706.pdf
> internal_id             | 104968051252620967298398595849898250327
> deleted                 | f
> store_number            | 0
> sequence_id             | 2
>
> This bitstream however is neither FILTERED nor SKIPPED.
>
> This database has been recently updated from v1.42 to v3, and I suspect
> the problem is somewhere in the db rather than a bug in the code, but
> everything *looks* right to me.  I can trace the relations from the
> community to collection to item, but for some reason the bitstreams are
> simply not checked.
>
> What do you think?
> Bill
>
>
> On Sun, Sep 22, 2013 at 12:35 PM, helix84 <[email protected]> wrote:
>
>> Hi Bill, please remember to keep dspace-tech in CC.
>>
>> Can you please tell me what the result of each of my suggestion was?
>> 1) What was the errorlevel of your filter-media command?
>> 2) Did you look at the log while it was running using "tail -f"?
>> 3) Were all the bitstreams you expected to be filtered in the ORIGINAL
>> bundle? (check at least a few)
>>
>>
>> On Fri, Sep 20, 2013 at 10:09 PM, Bill Tantzen <[email protected]> wrote:
>> > Hi Ivan!
>> >
>> > I've tried all these suggestions, and still, no success.
>> >
>> > There are no errors in the log, only entries of the form:
>> >
>> > 2013-09-20 15:00:24,802 INFO  org.dspace.search.DSIndexer @ Writing
>> > Community: 2408/36293 to Index
>> >
>> > And
>> >
>> > 2013-09-20 15:00:17,990 INFO  org.dspace.search.DSIndexer @ Writing
>> > Collection: 2408/35292 to Index
>> >
>> > One for each community and collection.  The bundles are ORIGINAL,
>> nothing
>> > special here...
>> >
>> > The database seems OK, I am able to follow the communities to
>> collections to
>> > items just fine, but no bitstreams are being filtered.
>> >
>> > I'll keep debugging on my end, but if you have any other ideas, do pass
>> them
>> > my way!
>> > Bill
>> >
>> >
>> > On Thu, Sep 19, 2013 at 9:08 AM, helix84 <[email protected]> wrote:
>> >>
>> >> Hi Bill,
>> >>
>> >> Jose's suggestion to look at the logs for errors is a good one. First
>> >> of all, we should determine whether the filtering failed during
>> >> processing some item or whether it completed with nothing else to
>> >> process.
>> >>
>> >> Also check the errorlevel of the command. 1 means error, 0 means
>> success.
>> >>
>> >>
>> >> On Thu, Sep 19, 2013 at 3:03 PM, Bill Tantzen <[email protected]>
>> wrote:
>> >> > Still working on this media filter issue -- maybe this might point
>> me in
>> >> > the
>> >> > right direction:  how are bitstreams selected for filtering?  Is it
>> >> > something like SELECT * FROM bitstream WHERE ???
>> >> > What is in the WHERE clause?  Or is there some other basis for
>> >> > selection?
>> >>
>> >> No, it's not SQL. It's a recursive call down the hierarchy, as you can
>> >> see in this method and the few following it: [1]
>> >>
>> >> However your WHERE suggestion got me thinking which bitstreams are
>> >> being processed and the answer is bitstreams in the ORIGINAL bundle.
>> >> So please check that your content bundles are called ORIGINAL and not
>> >> something else (e.g. THUMBNAIL or something custom).
>> >>
>> >> [1]
>> >>
>> https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L393
>> >> [2]
>> >>
>> https://github.com/DSpace/DSpace/blob/dspace-3.2/dspace-api/src/main/java/org/dspace/app/mediafilter/MediaFilterManager.java#L502
>> >>
>> >> Regards,
>> >> ~~helix84
>> >>
>> >> Compulsory reading: DSpace Mailing List Etiquette
>> >> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>> >
>> >
>>
>>
>>
>> Regards,
>> ~~helix84
>>
>> Compulsory reading: DSpace Mailing List Etiquette
>> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>>
>
>
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to