[
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318232#comment-14318232
]
Tim Allison commented on TIKA-1511:
-----------------------------------
Bottom line: it will be simpler to treat the full db with all tables as one big
file. We can still treat clobs and blobs as embedded documents.
Details:
When I tried to cut out the {{JDBCInputStream}} and just send in a zero byte
{{InputStream}}, regular parsing worked properly.
However, if a user tries to use a {{ParserContainerExtractor}}, that fails to
reach the BLOBs because of this:
{code}
MediaType type = detector.detect(tis, metadata);
if (extractor == null) {
// Let the handler process the embedded resource
handler.handle(filename, type, tis);
} else {
// Use a temporary file to process the stream twice
File file = tis.getFile();
// Let the handler process the embedded resource
InputStream input = TikaInputStream.get(file);
try {
handler.handle(filename, type, input);
} finally {
input.close();
}
// Recurse
extractor.extract(tis, extractor, handler);
}
{code}
When the extractor is called below the {{//Recurse}} comment, it only sees the
zero-byte {{TikaInputStream}}. It does not see the {{type}} or the
{{metadata}}. So, in the case of {{AutoDetectParser}}, it only sees a zero
byte {{InputStream}} and therefore detects it as {{application/octet-stream}}.
In short, there is no current way to pass the detected type through to the
extractor. We could, of course, add a parameter for {{type}} or {{metadata}}
to the ParserContainerExtractor's {{extract}} signature...
> Create a parser for SQLite3
> ---------------------------
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.6
> Reporter: Luis Filipe Nassif
> Fix For: 1.8
>
> Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch,
> testSQLLite3b.db, testSQLLite3b.db
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide
> range of applications. Opening the ticket to track it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)