[ https://issues.apache.org/jira/browse/TIKA-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818069#comment-17818069 ]
Gregory Lepore commented on TIKA-4198: -------------------------------------- For this set of data from the Bureau of Land Management it appears that the blobs are just in data and geom, but I wouldn't want to assume that that is the format of other agencies data. However, if it's true that the agencies are just entering text and it's getting converted to blobs, then I would want them removed from wherever they are in the GPKG file. > Skip blob fields in geopkg files > -------------------------------- > > Key: TIKA-4198 > URL: https://issues.apache.org/jira/browse/TIKA-4198 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > > Some geopkg tables store "geom" information in blob fields, starting with > magic: 47 50 00... > By default Tika handles blobs as embedded files. This can cause serious > resource waste on geopkg files that contain hundreds of thousands of rows > with a geom field. > We should create a new parser for geopkg that subclasses the sqlite parser > and skips blobs from the geom fields by default. -- This message was sent by Atlassian Jira (v8.20.10#820010)