Re: [oae-dev] Moving the preview processor to java

Erik Froese Fri, 17 Aug 2012 06:59:38 -0700

I wrote up how to run the Java PP as a OSGi service and a standalone jar.
Let me know if you need anything to get it up and running.


https://confluence.sakaiproject.org/display/KERNDOC/Using+the+Java+Preview+Processor

Erik

On Mon, Aug 13, 2012 at 11:53 AM, Erik Froese <erik.fro...@gmail.com> wrote:
> We have a spring planning meeting at rSmart today. I'll put it on the
> agenda for my work.
>
> I'm going on vacation 8/24 - 9/4 though so we'll have to work it out
> before then.
>
> Erik
>
> On Mon, Aug 13, 2012 at 11:36 AM, Nicolaas Matthijs
> <nicolaas.matth...@caret.cam.ac.uk> wrote:
>> Hi Erik,
>>
>> How about trying to turn this work into an official sprint deliverable
>> (co-ordinated with Kent)? Do you think you can free up some time for the
>> next sprint (August 16 - August 30) and will that give us enough time to do
>> remaining implementation, some testing and bug fixing?
>>
>> Thanks,
>> Nicolaas
>>
>>
>>
>>
>> On 25 Jul 2012, at 00:18, Erik Froese wrote:
>>
>>> I'm happy to let the Java PP testing slide to 1.5.0
>>>
>>> There are some recent improvements in the ruby PP that I need to
>>> implement.
>>> * sakaidocs - (easy, call out to wkhtmltopdf)
>>> * image previews in the same format as the original
>>>
>>> Erik
>>>
>>> On Tue, Jul 24, 2012 at 10:18 AM, Kent Fitzgerald <kentf...@umich.edu>
>>> wrote:
>>>>
>>>> Several questions/comments.
>>>> There has already been  1.4.1. release proposed for immediately following
>>>> 1.4.0 that would be isolated to code reformatting . Which would take
>>>> precedence?
>>>>
>>>> We should definitely do a bug bash. One of the dangers of doing a bug
>>>> bash
>>>> focused on the preview processor is that we'll likely have people
>>>> uploading
>>>> hundreds of files each. Subjectively, this could give the impression of
>>>> decreased performance just because we're hitting it much harder.
>>>>
>>>> More importantly, in addition to the bug bash, we need to do controlled
>>>> tests on processing time on different data types. I'd like to break it
>>>> down
>>>> by file types and have truly controlled tests, in addition to different
>>>> file
>>>> types we'll need files of varying  sizes to compare performance not just
>>>> on
>>>> quantity but on complexity. This needs to be compared to the performance
>>>> of
>>>> the current implementation.
>>>>
>>>> I think we all agree that this is an important feature that we shouldn't
>>>> try
>>>> to rush out the door.
>>>>
>>>> I have to read back through the thread, but is there set-up
>>>> documentation?
>>>> Currently we have a section on the OAE Configuration and Deployment page
>>>> [1]
>>>> for the preview processor. It's contains multiple supporting external
>>>> links
>>>> that have proven confusing for many people trying to get preview
>>>> processor
>>>> running locally. We'll need to make sure we have adequate documentation.
>>>>
>>>> As a side note, I will be out of the office starting this Friday through
>>>> next week.
>>>>
>>>>
>>>> [1]
>>>>
>>>> https://confluence.sakaiproject.org/display/3AK/OAE+Configuration+and+Deployment
>>>>
>>>>
>>>>
>>>> --
>>>> Kent Fitzgerald
>>>>
>>>> On Tuesday, July 24, 2012 at 9:51 AM, Nicolaas Matthijs wrote:
>>>>
>>>> Looks like this has been hanging around on list for a while now, and we
>>>> should probably try to move it forwards.
>>>>
>>>> The maintainability criterion can only be determined by a code review,
>>>> which
>>>> is standard practice. However, as this is proving to be such a critical
>>>> feature in production, I'd suggest that we do a separate bugbash to
>>>> evaluate
>>>> its performance, ease of setup (running from a separate machine) and most
>>>> importantly functional equivalence.
>>>>
>>>> When doing this, Kent can give his assessment of the ease of setup and
>>>> the
>>>> bugbashers can determine functional equivalence. We should also try to
>>>> have
>>>> it re-process the dummy content we usually bugbash with.
>>>>
>>>> If this all sounds good, I'd like to go ahead with this as soon as
>>>> possible
>>>> and run a bugbash straight after the 1.4.0 release with all of this set
>>>> up.
>>>> If the implementation survives the bugbash, it can be reviewed and
>>>> merged.
>>>>
>>>> Does that sound reasonable?
>>>>
>>>> Thanks,
>>>> Nicolaas
>>>>
>>>>
>>>>
>>>> On 23 Jul 2012, at 07:42, Carl Hall wrote:
>>>>
>>>> Lance, I think the work is already split the way you suggest given what I
>>>> know about what Erik has done (rewrite in Java) and what's left (add
>>>> JMS).
>>>> Adding message queue capabilities should not hold back reviewing the
>>>> proposed changes.
>>>>
>>>> I would say that it needs to meet these opening criteria for my general
>>>> acceptance:
>>>>
>>>> * Be functionally equal with the current solution
>>>> * A combination of performance and maintainability
>>>>   * Perform can be no worse overall. There might be different hotspots in
>>>> the java version than the current ruby solution but there shouldn't be
>>>> anything exponentially worse. Overall, the java version has to perform at
>>>> least as good and hopefully better. Memory usage, overall processing
>>>> time,
>>>> resource usage (iops, disc reads, caching) should all be considered.
>>>>   * Be more maintainable than the Ruby solution. The current code has had
>>>> very little cleaning and is not very readable. This includes using
>>>> externally available libraries where possible. We shouldn't be
>>>> maintaining
>>>> plumbing not inherent to our domain.
>>>> * Easier to setup. Though our current setup for the ruby PP is known to
>>>> be
>>>> problematic, we at least are accustomed to it. The proposed solution has
>>>> got
>>>> to be more straightforward and less fragile.
>>>>
>>>> The numbers I've seen from some preliminary testing showed the Java impl
>>>> to
>>>> take exponentially *less* time to process pdfs and was faster than the
>>>> ruby
>>>> PP in every test. It's an OSGi bundle and written in Java like the rest
>>>> of
>>>> our project which makes it easier to setup and maintain as we write far
>>>> more
>>>> java code than ruby. I believe there's also already a setup available to
>>>> run
>>>> the java PP as a standalone server.
>>>> The Java version introduces a topia term extractor bundle which is a port
>>>> from the Python version. This is a point of maintenance to consider but
>>>> the
>>>> python code has changed in years. It's a common impl for other languages
>>>> to
>>>> port but there wasn't a java version around. I would like to see this
>>>> code
>>>> find a permanent home in a relative OSS project. At the very least it
>>>> should
>>>> be maintained apart from OAE core to make it available to a broader
>>>> audience.
>>>>
>>>> +1 to getting this code wrapped up and reviewed.
>>>>
>>>> On Wed, Jul 18, 2012 at 1:51 PM, Christian Vuerings
>>>> <vueringschrist...@gmail.com> wrote:
>>>>
>>>> I'm not sure whether this is already part of the criteria list or not,
>>>> but
>>>> what about CPU/Memory usage?
>>>> Is there a way we can measure that and compare it to the current ruby
>>>> based
>>>> PP?
>>>> When I currently run the ruby PP locally, it's usually one of the
>>>> processes
>>>> that uses the most resources.
>>>>
>>>> One other thing I'm curious about is how well it will compress/handle the
>>>> different file formats (png/jpg/gif/psd)
>>>>
>>>> These are just 2 things that I'm interested in since they (can) have an
>>>> impact on the overall performance.
>>>>
>>>>
>>>> - Christian
>>>>
>>>> On Jul 18, 2012, at 12:41 PM, Lance Speelmon wrote:
>>>>
>>>> Does anyone have an opinion about adopting the new java based PP?
>>>> Specifically can you articulate acceptance criteria for such an adoption?
>>>> e.g.
>>>>
>>>> Must support same preview behaviors as existing ruby-based PP.
>>>> Must pass QA with all blocker and critical items resolved.
>>>> Must start automatically OOTB to support the tire-kicking, web-start
>>>> uses.
>>>> Must leverage as much 3rd party code as possible to minimize ownership
>>>> costs.
>>>> Must pass code review.
>>>> Unit test code coverage.
>>>> Basic config and deployment documentation.
>>>>
>>>>
>>>> What is missing?  Anything?  Thanks, L
>>>>
>>>>
>>>>
>>>> On Jul 17, 2012, at 2:58 PM, Lance Speelmon <la...@rsmart.com> wrote:
>>>>
>>>> Is there any way to break this work down into chunks?  e.g.
>>>>
>>>> 1. Adopt java PP as default PP moving forward. What are the acceptance
>>>> criteria?
>>>> 2. Enhance new java PP with message queue abilities.
>>>>
>>>> WDYT?  Thanks, L
>>>>
>>>> On Jul 17, 2012, at 8:34 AM, Carl Hall <c...@hallwaytech.com> wrote:
>>>>
>>>> Each app server could run it's own queues but that wouldn't support
>>>> building
>>>> a farm of PP processors unless we also teach them to talk to multiple JMS
>>>> servers. Maybe something like DNS round-robin would suffice?
>>>>
>>>> On Tue, Jul 17, 2012 at 8:25 AM, Erik Froese <erik.fro...@gmail.com>
>>>> wrote:
>>>>
>>>> Do we need to cluster activemq? Can't each app server service its own
>>>> queues?
>>>> Erik
>>>>
>>>> On Tue, Jul 17, 2012 at 11:23 AM, Carl Hall <c...@hallwaytech.com> wrote:
>>>>>
>>>>> What Erik describes has been on the dev wish list for a little while
>>>>> now.
>>>>> Moving to an event-driven model would allow us to build out concurrency
>>>>> but
>>>>> there also comes the question of clustering ActiveMQ.
>>>>>
>>>>>
>>>>> On Thu, Jul 12, 2012 at 6:27 AM, Erik Froese <erik.fro...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hey David,
>>>>>>
>>>>>> The code is not clustered.
>>>>>>
>>>>>> You'd need to write an event listener that would fire when new content
>>>>>> is uploaded. It would put the content ids on a JMS queue. Then
>>>>>> implement a ContentFetcher that grabs a message off of the queue and
>>>>>> wire that into the PPI. Events and Messages are not clustered in OAE
>>>>>> (AFAIK) so this would have to be run on each app server.
>>>>>>
>>>>>> While we're in event-land it'd be nice to be able to regenerate a
>>>>>> preview when a content body is updated. I'm not sure if this is
>>>>>> possible yet.
>>>>>>
>>>>>> I'm not sure how we'd limit the CPU usage yet either. You could manage
>>>>>> the quartz schedule that runs the PPI.
>>>>>>
>>>>>> We can also disable concurrent executions of the job.
>>>>>>
>>>>>> Erik
>>>>>>
>>>>>> On Wed, Jul 11, 2012 at 8:44 PM, Roma, David <dr...@csu.edu.au> wrote:
>>>>>>>
>>>>>>> Awesome news Erik!
>>>>>>>
>>>>>>> Our Ops guys will be stoked when we can get this in.. A couple of
>>>>>>> questions from someone who hasn't looked at the code or read too
>>>>>>> deeply....
>>>>>>> - Does it support clustering
>>>>>>>        -e.g. can we just run it side by side on each of our app
>>>>>>> servers
>>>>>>> and they will play nice sharing out processing jobs?
>>>>>>>        -will it affect performance of the app servers much? Can we
>>>>>>> limit the preview processor to say 10%cpu and 500mb ram or low
>>>>>>> priority
>>>>>>> threads or limit the number of items to process or something? This
>>>>>>> would
>>>>>>> make for a nice simple deployment that wouldn't threaten the app
>>>>>>> server
>>>>>>> stability.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dave.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: oae-dev-boun...@collab.sakaiproject.org
>>>>>>> [mailto:oae-dev-boun...@collab.sakaiproject.org] On Behalf Of Erik
>>>>>>> Froese
>>>>>>> Sent: Thursday, 12 July 2012 2:37 AM
>>>>>>> To: Carl Hall
>>>>>>> Cc: oae-dev@collab.sakaiproject.org; Clay Fenlason
>>>>>>> Subject: Re: [oae-dev] Moving the preview processor to java
>>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> Its been a few months but I actually implemented the Java preview
>>>>>>> processor as an OSGi bundle. I filed a ticket for it [1]
>>>>>>>
>>>>>>> I'm not sure where to go from here. Is this something that could be
>>>>>>> included POST 1.4.0?
>>>>>>> Should I open a PR so we can review the code? If so, PR against which
>>>>>>> branch?
>>>>>>>
>>>>>>> Either way, have a look, give it a go. We'll probably wind up using it
>>>>>>> at rSmart.
>>>>>>>
>>>>>>> Erik
>>>>>>>
>>>>>>> [1] https://jira.sakaiproject.org/browse/KERN-3021
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 17, 2012 at 9:09 AM, Carl Hall <c...@hallwaytech.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I totally agree that we should ally ourselves with other communities.
>>>>>>>> I
>>>>>>>> see
>>>>>>>> where we get docsplit from DocumentCloud[1] and we use several other
>>>>>>>> libraries for processing that they've most likely contributed to.
>>>>>>>> The Java approach is very little custom code compared to the
>>>>>>>> libraries
>>>>>>>> we're
>>>>>>>> getting from Apache (tika, sanselan, commons, pdfbox), so we would
>>>>>>>> still
>>>>>>>> building on the shoulders of our friendly community giants.
>>>>>>>>
>>>>>>>> 1 https://github.com/documentcloud/docsplit
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 14, 2012 at 5:43 AM, John Norman <j...@caret.cam.ac.uk>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My recollection (perhaps wrong) is that  we got this from Document
>>>>>>>>> Cloud
>>>>>>>>> and I /think/ Chris Roby found it. Document Cloud seems a very
>>>>>>>>> relevant and
>>>>>>>>> valuable project. If we were able to help them while helping
>>>>>>>>> ourselves,
>>>>>>>>> other good things could come from the relationship. My general point
>>>>>>>>> is that
>>>>>>>>> we are thin on resources and so, in principle, symbiotic
>>>>>>>>> relationships
>>>>>>>>> are
>>>>>>>>> helpful.
>>>>>>>>>
>>>>>>>>> http://www.documentcloud.org/home
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>> Sent from my iPad
>>>>>>>>>
>>>>>>>>> On 13 Apr 2012, at 17:03, Carl Hall <c...@hallwaytech.com> wrote:
>>>>>>>>>
>>>>>>>>> I agree with Daniel that our modifications to the preview processor
>>>>>>>>> have
>>>>>>>>> put its ownership square on us. Was there a community that this
>>>>>>>>> script
>>>>>>>>> was
>>>>>>>>> borrowed from? I thought it was original development that uses
>>>>>>>>> various
>>>>>>>>> external libraries to do the actual work. This is the approach that
>>>>>>>>> Erik is
>>>>>>>>> taking with the rewrite using things like Tika (text extraction),
>>>>>>>>> Sanselan
>>>>>>>>> (images) and a Java port of the python topia.termextract library.
>>>>>>>>>
>>>>>>>>> I certainly don't deny the speed of development that was realized in
>>>>>>>>> creating the PP but the current state of the code is a mess at best.
>>>>>>>>> Reuse
>>>>>>>>> of libraries in Java is showing a fast rewrite with very little
>>>>>>>>> managed code
>>>>>>>>> on our part.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 13, 2012 at 12:50 AM, Daniel Parry
>>>>>>>>> <dan...@caret.cam.ac.uk>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 12, 2012 at 04:21:36PM -0400, Clay Fenlason wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think this response is at best orthogonal to the point John's
>>>>>>>>>>> trying
>>>>>>>>>>> to raise, though I gather this kind of reaction must come from a
>>>>>>>>>>> buildup of some real frustration around the PP, which I don't mean
>>>>>>>>>>> to
>>>>>>>>>>> discount. I also think John was pretty clear about what he was
>>>>>>>>>>> suggesting: that there be a conversation with the community we got
>>>>>>>>>>> the
>>>>>>>>>>> PP from, if the conversation hasn't happened already, to see if
>>>>>>>>>>> there
>>>>>>>>>>> might still be a way to work together before we decide to just own
>>>>>>>>>>> it
>>>>>>>>>>> ourselves.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd suggest the way that the preview processor was being extended
>>>>>>>>>> (initially a
>>>>>>>>>> python server add on, followed by a ruby rewrite for tag
>>>>>>>>>> extraction)
>>>>>>>>>> and
>>>>>>>>>> the
>>>>>>>>>> variety of ruby versions that deployers were using and the methods
>>>>>>>>>> used
>>>>>>>>>> to
>>>>>>>>>> deploy it were indicative of a) the OAE community already 'owning'
>>>>>>>>>> the PP
>>>>>>>>>> and b)
>>>>>>>>>> as has already been pointed out some standardization needed
>>>>>>>>>> restoring
>>>>>>>>>> and
>>>>>>>>>> additional functionality added for deployers.  Hence, the list was
>>>>>>>>>> pinged[0] a
>>>>>>>>>> while back to ask about standardizing and extending in java. I'm
>>>>>>>>>> not
>>>>>>>>>> sure
>>>>>>>>>> of any
>>>>>>>>>> other way to contact the original PP community or if such a
>>>>>>>>>> community
>>>>>>>>>> separate
>>>>>>>>>> to OAE even still exists?
>>>>>>>>>>
>>>>>>>>>> Best wishes,
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>>>>>> [0]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://collab.sakaiproject.org/pipermail/oae-dev/2012-April/001677.html
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --| Daniel Parry: dan...@caret.cam.ac.uk. www.caret.cam.ac.uk/ |--
>>>>>>>>>> "Of all the things a leader should fear, complacency should
>>>>>>>>>> head the list." [John C. Maxwell]
>>>>>>>>>> _______________________________________________
>>>>>>>>>> oae-dev mailing list
>>>>>>>>>> oae-dev@collab.sakaiproject.org
>>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> oae-dev mailing list
>>>>>>>>> oae-dev@collab.sakaiproject.org
>>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> oae-dev mailing list
>>>>>>>> oae-dev@collab.sakaiproject.org
>>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> oae-dev mailing list
>>>>>>> oae-dev@collab.sakaiproject.org
>>>>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>>>> Charles Sturt University
>>>>>>>
>>>>>>> | ALBURY-WODONGA | BATHURST | CANBERRA | DUBBO | GOULBURN | MELBOURNE
>>>>>>> |
>>>>>>> ONTARIO | ORANGE | PORT MACQUARIE | SYDNEY | WAGGA WAGGA |
>>>>>>>
>>>>>>> LEGAL NOTICE
>>>>>>> This email (and any attachment) is confidential and is intended for
>>>>>>> the
>>>>>>> use of the addressee(s) only. If you are not the intended recipient of
>>>>>>> this
>>>>>>> email, you must not copy, distribute, take any action in reliance on
>>>>>>> it
>>>>>>> or
>>>>>>> disclose it to anyone. Any confidentiality is not waived or lost by
>>>>>>> reason
>>>>>>> of mistaken delivery. Email should be checked for viruses and defects
>>>>>>> before
>>>>>>> opening. Charles Sturt University (CSU) does not accept liability for
>>>>>>> viruses or any consequence which arise as a result of this email
>>>>>>> transmission. Email communications with CSU may be subject to
>>>>>>> automated
>>>>>>> email filtering, which could result in the delay or deletion of a
>>>>>>> legitimate
>>>>>>> email before it is read at CSU. The views expressed in this email are
>>>>>>> not
>>>>>>> necessarily those of CSU.
>>>>>>>
>>>>>>> Charles Sturt University in Australia  http://www.csu.edu.au  The
>>>>>>> Chancellery, Panorama Avenue, Bathurst NSW Australia 2795  ABN: 83 878
>>>>>>> 708
>>>>>>> 551; CRICOS Provider Numbers: 00005F (NSW), 01947G (VIC), 02960B (ACT)
>>>>>>>
>>>>>>> Charles Sturt University in Ontario  http://www.charlessturt.ca 860
>>>>>>> Harrington Court, Burlington Ontario Canada L7N 3N4  Registration:
>>>>>>> www.peqab.ca
>>>>>>>
>>>>>>> Consider the environment before printing this email.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> oae-dev mailing list
>>>> oae-dev@collab.sakaiproject.org
>>>> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>>>>
>>
_______________________________________________
oae-dev mailing list
oae-dev@collab.sakaiproject.org
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Moving the preview processor to java

Reply via email to