Thank you Ken and Nick.
You were right. Instead of passing the bytes, I pass now the URL and it works. Avi. On Fri, Jul 11, 2014 at 6:08 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > > On Jul 11, 2014, at 8:01am, Avi Hayun <avrah...@gmail.com> wrote: > > > Hi, > > > > Scenario: > > 1. I use tika-core in my app > > 2. I use the following to detect the stream's media type: > > > > byte[] bytes = IOUtils.toByteArray(new URL(" > http://www.amazon.com/sitemap_ > > video.xml")); > > String contentType = new Tika().detect(bytes); > > > > obviously when looking at the sitemap - it is of type application/XML > > > > BUT > > > > Tika returns content type of: plain/text instead of application/xml !? > > > > Upon debugging, I get to the following class: > > CompositeDetector.detect(InputStream input, Metadata metadata)... > > > > Which returns the wrong content type. > > > > ANyone has any idea how to solve it? > > > The returned content starts with > > <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" > xmlns:video="http://www.google.com/schemas/sitemap-video/1.0"> > > Which is why it isn't detected as XML, given the current set of strings > being used for matching in tika-mimetypes.xml > > You could put into the metadata tthe returned Content-type header, which > is text/xml for the above example, and then I think it would work. > > But we should also beef up XML detection, e.g. with a pattern like <blah > xmlns=" > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > >