Re: Tika on Jenkins?
+1 to using travis-ci.. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-283, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jukka Zitting jukka.zitt...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Wednesday, January 29, 2014 1:06 PM To: Tika Development dev@tika.apache.org Subject: Re: Tika on Jenkins? Hi, On Tue, Jan 28, 2014 at 10:37 AM, Allison, Timothy B. talli...@mitre.org wrote: How do we fix the Tika build in Jenkins? I've kind of lost hope with ASF's Jenkins installation, it's been broken in various ways for years now. I used to participate in administering the service, but I guess it's gotten way too complex nowadays for part-time volunteers to manage. Unless one of us wants to step up and get their hands dirty fixing Jenkins issues, I'd probably opt to disable the Jenkins build entirely and instead use something like Travis (https://travis-ci.org/) for our CI builds. BR, Jukka Zitting
Re: Extract thumbnail from openxml office files
Hi Hong-Thai, +1 to using cardinality to help denote more complex metadata relationships at least until we get past prior discussions on Metadata and name spacing. See the wiki here for some prior past thoughts: http://wiki.apache.org/tika/MetadataDiscussion I know our met structure is simple -- it was purposefully designed that way even though at the time very complex and hierarchical metadata structures existed and could have been leveraged but instead were not in favor of a simple approach , e.g., key mutli-value (note distinction between key value). Thanks! Cheers, Chris -Original Message- From: Hong-Thai Nguyen hong-thai.ngu...@polyspot.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, January 9, 2014 8:36 AM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: Extract thumbnail from openxml office files Hi Nick, You're begining a very interesting topic about foundation of our metadata concept :) I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. For exemple, we would store author's information, each author has a first name and a last name. Ideally, we could have some like struct: Person: FirstName LastName An other example is for our futur thumbnail. If we can have a metadata 'thumbnail' with hierarchical structure like: Thumbnail: Dimension Width Length MimeType Extension Pages Description That needs a huge refactoring about our core model. An other solution is we can keep thumbnail result is a list Listbyte[] insteads of a single value. An element is the thumbnail of a page. If the list has only 1 element, mean there's only thumbnail of the first page. Hong-Thai -Message d'origine- De : Nick Burch [mailto:apa...@gagravarr.org] Envoyé : jeudi 9 janvier 2014 12:11 À : dev@tika.apache.org Objet : RE: Extract thumbnail from openxml office files On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: By searching on issues, I found the issue already created: https://issues.apache.org/jira/browse/TIKA-90 I'm not sure if the metadata is the right place to return this. Some formats offer a small thumbnail, others can offer a small thumbnail for every page, and at least one can include a full-size image of the first page. Would we not be better off exposing these embedded renderings via the existing embedded resources handling, with some sort of handy way to identify what something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of page 3)? Nick
Re: Apache tika installation issue
Dear Sudheer, Did you receive a reply to your question? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: sudheer y sudhe...@datahubsoftware.in Date: Tuesday, September 17, 2013 12:02 AM To: dev-ow...@tika.apache.org dev-ow...@tika.apache.org Subject: Apache tika installation issue Dear Experts, Can you give step by step guide to install apache tika in eclipse using maven on windows. -- Thanks Best Regards, Sudheer Kumar Y Software Engineer DATAHUB SOFTWARE INDIA PVT LTD. | MAKING IT POSSIBLE Mobile: +91 8143161684 Email: sudhe...@datahubsoftware.in WEB : www.datahubsoftware.com http://www.datahubsoftware.com
Re: Apache Tika for Android
Hi Vasiliy, It would be great if you could use Apache Tika (which in turn uses Apache PDFBox) and if it will run on Android. Have you tried it? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Василий Саржинский vasiliy.sarzhins...@mail.ru Reply-To: Василий Саржинский vasiliy.sarzhins...@mail.ru Date: Thursday, August 29, 2013 11:21 PM To: Oleg Tikhonov olegtikho...@gmail.com, dev@tika.apache.org dev@tika.apache.org, dev-owner dev-ow...@tika.apache.org Subject: Apache Tika for Android Hello, Oleg! Yes, you are right. But I am search a lot of information in the Internet and unfortunately I couldn't find tool or library that can execute on Android and can be free for commercial use. There are a lot such tools and libraries for Android (e.g. iText, Qoppa, pdflib TET), but they are not free of charge. If you know some free tools on Android for extracting text from pdf, could you please give me advice what I have to use. Thanks a lot! With Best Regards, Vasiliy Sarzhinskiy
Re: Would become a commiter
Dear Hong-Thai, Thanks for your interest in the project! Also thanks for your recent contributions. Apache is a meritocracy and committership/PMC membership is decided upon by the Tika PMC. We again appreciate your interest in the project. Cheers, Chris -Original Message- From: Hong-Thai Nguyen hong-thai.ngu...@polyspot.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Wednesday, July 31, 2013 5:43 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Would become a commiter Hi all, I¹m currently working at PolySpot, a provider of Search Engine Solutions. Tika is one of component at Connector side to extract many kind of files. We must upgrade frequently Tika within our product, test and fix eventually some parsing bugs of Tika. We must release temporally in our local repository by attending new official release Tika version. With these synergy, I would like to be a committer at Tika project. Regards, Hong-Thai Nguyen, PhD RD Engineer DDI: +33 (0)1 77 75 73 15 Mob: +33 (0)6 27 04 86 22 Skype: thaichat04 hong-thai.ngu...@polyspot.com http://www.polyspot.com/ http://twitter.com/polyspot http://www.linkedin.com/company/polyspot 79, rue du Faubourg Poissonnière 75009 Paris - France Access map http://g.co/maps/3e53 PPlease consider the environment before printing this email This message may contain confidential or privileged information. If you are not the intended recipient, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy. Ce message peut contenir des informations confidentielles ou privilégiées. Si vous n'êtes pas le destinataire prévu, merci de bien vouloir en prévenir l'expéditeur immédiatement par retour de message électronique et de détruire ce message et toute éventuelle pièce jointe sans en conserver de copie.
Re: [Announce] Welcome Tim Allison as Tika PM member and committer
Welcome, Tim! ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch n...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, July 30, 2013 5:29 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [Announce] Welcome Tim Allison as Tika PM member and committer Hi All The Tika PMC VOTE'd to add Tim Allison tallison@ to our merry group as a PMC member and committer. Welcome, Tim! Please feel free to say a bit about yourself. Cheers Nick
Re: Patches for parser.microsoft.WordExtractor
Dear Denis, Thank you for your contribution to Tika! Filing an issue would be great, head over here: https://issues.apache.org/jira/browse/TIKA Please sign up for an account, create an issue and then attach your patch there. I for one would welcome the contribution and am happy to help shepherd it into the sources. Thank you! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kildishev kildis...@ispras.ru Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, July 1, 2013 5:00 AM To: dev@tika.apache.org dev@tika.apache.org Cc: Khoroshilov khoroshi...@ispras.ru Subject: Patches for parser.microsoft.WordExtractor Dear Tika developers, My name is Denis Kildishev and I am working for Institute for System Programming of the Russian Academy of Sciences (ISPRAS). We use Apache Tika in our open source project Requality (https://forge.ispras.ru/projects/reqdb) for doc-xhtml conversion. One of our requirements is getting xhtml visual representation close to original doc one. Working with current version of Tika we found that some improvements can be made over it. I'd like to introduce some modifications that were made on Word Extractor from parsers package. They includes support of lists, table borders(according to 2007 specification) and some additional changes on styling and indents. Also, in our version of this parser we have XHTML commands buffer that helps to deal with a problem of nested tables. If it is possible, I'd like to contribute those changes back to the Tika project. As a first of possible patches I'd like to present changes over table representation. This patch includes changes over table representation. The information about border color is related to specification of 2007 format. Spanning of cells is taken from poi html parser. Some of patches, including this one, alters the structure of generated XHTML file. Different changes are made over existing unit tests to deal with this fact. All those changes preserve original original test purposes, but in different way. As an example may be a check of table to be on output file. As for current trunk version, it is checked by looking for clear table construction. When we introduces styling to table, this construction tends to be wrong, so, we can looks for table instead. I will create a corresponding ticket and I will attach my patch there. It is my first contribution to an Apache project, so I would appreciate if you guide me how to proceed with it. Yours sincerely, Denis Kildishev Software Engineering Department, ISPRAS
Re: [VOTE] Apache TIka 1.4 Release Candidate #1
Hey Uwe, seems to work on both latest 3.0.3 and 2.2.1 for me. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Uwe Schindler u...@thetaphi.de Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, June 16, 2013 10:35 AM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: [VOTE] Apache TIka 1.4 Release Candidate #1 I forgot: What is the official TIKA-supported Maven version? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, June 16, 2013 7:33 PM To: dev@tika.apache.org Subject: RE: [VOTE] Apache TIka 1.4 Release Candidate #1 Hi Mike, Do you have coordinates of SVN to checkout and which Maven Command to run (for randomized JDK I need a free-style build, so I need the correct Maven command line+ goals to test everything). How to pass JDK command line options to Maven's Surefire (like -Xfoobar -XX+UseG1GC... - in Lucene it's -Dargs=..., what is the maven equivalent to drive the Surefire Child JVM)? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Sunday, June 16, 2013 4:43 PM To: dev@tika.apache.org Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #1 On Sun, Jun 16, 2013 at 10:13 AM, Uwe Schindler u...@thetaphi.de wrote: I can setup a windows build on the well-known Policeman Jenkins server with the famous random JDK versions and many more features, running Lucene tests in 24/7 :-) http://goo.gl/qnxlJ for the talk http://jenkins.thetaphi.de/ +1! Mike McCandless http://blog.mikemccandless.com
Re: [VOTE] Apache TIka 1.4 Release Candidate #1
Hey Uwe, Hmmm, I think command line options for Maven work the same way as in Lucene (e.g., -Dargs). Give it a shot and let me know how it works :) Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Michael McCandless luc...@mikemccandless.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, June 16, 2013 10:43 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #1 On Sun, Jun 16, 2013 at 1:32 PM, Uwe Schindler u...@thetaphi.de wrote: Do you have coordinates of SVN to checkout svn checkout https://svn.apache.org/repos/asf/tika/trunk should work. and which Maven Command to run (for randomized JDK I need a free-style build, so I need the correct Maven command line+ goals to test everything). I think just mvn test? How to pass JDK command line options to Maven's Surefire (like -Xfoobar -XX+UseG1GC... - in Lucene it's -Dargs=..., what is the maven equivalent to drive the Surefire Child JVM)? Hmm this is beyond me! Anyone else? Mike McCandless http://blog.mikemccandless.com
Re: [VOTE] Apache TIka 1.4 Release Candidate #1
Hey Oleg, are you sending from an address that's subscribed to the list? I got this email..I also received your +1 before too on RC #1 for 1.4. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org, o...@apache.com o...@apache.com Date: Sunday, June 16, 2013 11:16 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #1 I've tried to send some comments about release candidate, however got delivery failure error. I'm out of list ? BR, Oleg On Sun, Jun 16, 2013 at 9:07 PM, Chris Mattmann mattm...@apache.org wrote: Ouch, just saw this. Oliver, I'm happy to commit the updated patch to the trunk but do you absolutely need this in 1.4 requiring me to spin up an RC #3? Cheers, Chris -Original Message- From: Oliver Heger oliver.he...@oliver-heger.de Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, June 16, 2013 10:25 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #1 Am 16.06.2013 05:52, schrieb Chris Mattmann: Hi Guys, A candidate for the Tika 1.4 release is available at: http://people.apache.org/~mattmann/apache-tika-1.4/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.4/ The SHA1 checksum of the archive is 1e523c6ed06b4d095d7f6e93a04a8d2ab43c7226. A staged M2 repository can also be found on repository.apache.org here: https://repository.apache.org/content/repositories/orgapachetika-020/ Please vote on releasing this package as Apache Tika 1.4. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.4 [ ] -1 Do not release this package because... Here is my +1 for the release. Cheers, Chris There is a minor issue with TIKA-991: The original patch had been applied, but in the meantime I discovered that the code could enter an infinite loop under certain circumstances. Therefore, I provided a second patch (the small attachment from Feb 15th). Could this patch be applied, too, before the release? Thanks Oliver
Re: [VOTE] Apache TIka 1.4 Release Candidate #2
Hey Uwe, -Original Message- From: Uwe Schindler u...@thetaphi.de Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, June 16, 2013 2:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: [VOTE] Apache TIka 1.4 Release Candidate #2 With Maven 3.0.4 it worked because Maven 3 is able to resolve inter-module dependencies inside the reactor. Maven 2 needs all artifacts to be installed so a plain mvn test without mvn install before does not work: http://ahoehma.wordpress.com/2010/12/22/intermodule-dependencies-now-bette r-working-with-maven-3/ +1. I've found that with MVN 2.2.1 I need to do mvn install first before mvn test IIRC. Maven 3 exhibits the same behavior for me too. So nothing has really changed with 1.4 compared to other releases. But I found a problem: TIKA does not work with IBM J9 - the ForkParserTest fails horrible with class not found and so on: http://jenkins.thetaphi.de/job/Tika-1.4RC2-Linux/12/consoleFull Would be great if you could file a ticket for 1.5 with this. Thanks Uwe. Cheers, Chris Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Sunday, June 16, 2013 11:11 PM To: dev@tika.apache.org Subject: RE: [VOTE] Apache TIka 1.4 Release Candidate #2 Hi, I tried a Random JVM Jenkins build on Linux from the repository (with a clean ~/.mvn folder, so no Maven config at all), but it fails with the given error: http://jenkins.thetaphi.de/job/Tika-1.4RC2-Linux/5/consoleFull It tries to download Tika from Maven Central although it's not yet there. Looks like a circular dependency to itself... The same happens in trunk, but thare it works because it downloads the previous snapshot from Apache Snapshots repo. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Sunday, June 16, 2013 8:07 PM To: dev@tika.apache.org Cc: u...@tika.apache.org Subject: [VOTE] Apache TIka 1.4 Release Candidate #2 Hi Guys, A second candidate for the Tika 1.4 release is available at: http://people.apache.org/~mattmann/apache-tika-1.4/rc2/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.4-rc2/ The SHA1 checksum of the archive is 84ce9ebc104ca348a3cd8e95ec31a96169548c13 A staged M2 repository can also be found on repository.apache.org here: https://repository.apache.org/content/repositories/orgapachetika-022/ Please vote on releasing this package as Apache Tika 1.4. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.4 [ ] -1 Do not release this package because... Here is my +1 for the release. Cheers, Chris
[ANNOUNCE] Open Source Summit 3.0: Communities Meeting: June 25,26, Washington DC USA
(apologies for cross-post) http://ossummit.org/ Registration for the Open Source Summit v3.0: Communities Meeting is now open! This event grows out of the past two years of success. The first Open Source Summit http://www.nasa.gov/open/source/ was held at NASA's Ames Research Center in Mountain View, CA and focused specifically on NASA's open source policies. Last year, OSS http://open.nasa.gov/summit/ [full schedule] https://hackpad.com/Open-Source-Summit-2012-Live-Schedule-dXG9B8U2KkZ moved to the University of Maryland in College Park and broadened the discussion to include all agencies, including NASA, the State Dept, and the VA on the planning team. This year's Open Source Summit will explain how to build, engage with, and maintain open source communities -- and when we say open source, we don't just mean software, we also mean hardware and data. If you are a federal civil servant that needs to build or engage with an open source community, you should plan on attending. Be warned however: this is not your average event! The multi-agency planning team is tasked with ensuring that the event provides substantive benefit to federal agency personnel, and the format is uniquely designed to deliver not just abstract content from subject matter experts (of course we have those), but also the opportunity to see this knowledge applied to a specific case study, and then to learn how to apply it to your specific situation. In addition, we will collate the results of the discussions during the event and making them available afterwards so that other may learn from the shared experiences and wisdom of their peers. Also, fyi - the registration numbers are unlimited for government employees, but limited t o 70 for non-govies. We have a total registration limit of 200. Please use the #ossdc hashtag if you want to have conversations on Twitter about the meeting. Thanks and I look forward to your registrations and interest! If you have any questions, Please don't hesitate to ask either by replying to this email, contacting me or anyone else on the Planning Team directly, or by sending us a Tweet! Thanks! Cheers, Chris Mattmann (on behalf of the Planning Team) ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Wanting to contribute to Tika (was Re: [jira] [Commented] (TIKA-992) OpenGraph meta tags to allow multiple values)
Thanks Pankaj. You may want to start a new thread with specific topics that you'd like to discuss. This is a thread related to JIRA and TIKA-992 specific to OpenGraph. I suggest you: * hang around on dev@ and see if there are topics that interest you that spring up and contribute to the discussion there * review Tika code and suggest improvements, etc., to it, in new threads, on in Tika JIRA, reff'ed below. * review Tika JIRA and existing open bugs/issues and contribute there HTH! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Pankaj Kumar pank...@usc.edu Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, May 13, 2013 10:04 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [jira] [Commented] (TIKA-992) OpenGraph meta tags to allow multiple values Hello All, I am new learner of Apache Tika and am very much interested to do some projects using it. So, it would be very kind of you, if you could suggest me some project ideas. With Regards, Pankaj Kumar On Sun, May 12, 2013 at 12:49 PM, kiran (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/TIKA-992?page=com.atlassian.jira.pl ugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13655622#com ment-13655622] kiran commented on TIKA-992: Hi, MultiValues are not stored for any metatags in the HTML and any metatag can have multiValued fields too. When we use Tika for parsing with Nutch, we noticed that Tika does not store the multiValues for any html metatag. Tika only places one value in the DOM tree as reported in NUTCH-1467. Does this patch allow Tika to have multiValues for any metatag or just OpenGraph metatags ? OpenGraph meta tags to allow multiple values Key: TIKA-992 URL: https://issues.apache.org/jira/browse/TIKA-992 Project: Tika Issue Type: Bug Affects Versions: 1.3 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: TIKA-992-1.3-1.patch HtmlHandler should use Metadata.add() for Open Graph properties instead of the HtmlHandler.addHtmlMetadata() method which uses Metadata.set(). The og:* properties can be multivalued. The Metadata.set() method overwrites previous entries because it doesn't use Metadata.appendedValues(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira