[ 
https://issues.apache.org/jira/browse/SOLR-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242376#comment-13242376
 ] 

Chris A. Mattmann commented on SOLR-3295:
-----------------------------------------

Hi Guys:

Couple of comments.

bq. Thanks for doing the test. I know this already because I hit that, too. Its 
caused by TIKA's dependencies. The NetCDF 
(http://www.unidata.ucar.edu/software/netcdf/) parser is only compiled with 
Java 1.6, although TIKA is also only Java 1.5, so this is a TIKA bug.

In Tika, I wouldn't classify this as a bug, since our parser jar dependencies 
can be excluded in various ways. It's simply a requirement for folks that are 
interested in all of the features that the NetCDF library provides, but if you 
don't care about parsing those types of files, you can simply omit that parser 
and exclude the jar file dependency.

bq. 's obscure, indeed, especially for people outside the climate community. 

Obscure? Sorry, not meaning to argue here, but that's pretty patently untrue. 
All data formats are at some level obscure, depending on the community that you 
work in. The "climate" one that you are talking about includes a broad range of 
folks, dealing with remote sensing, climate modeling, decision making, etc., at 
some of the highest levels of government, funding, and other areas, both in the 
U.S. and internationally. NetCDF, and HDF, OPeNDAP, and other formats are 
pretty broadly accepted standards. The use of data from NetCDF for example, 
resulted in over 2000+ publications generated as part of the last 
Intergovernmental Panel on Climate Change (IPCC) and its 4th assessment report 
:) So, not sure it's obscure.

bq. he UCAR netcdf library is on the other hand not able to handle streaming 
file input, so TIKA loads the whole file into memory

Yep, it's part of the issue of the underlying data file format more so than the 
actually library itself. It's because it doesn't support random access and yes 
the current code I had to bake into Tika unfortunately must work around it by 
loading the whole file into memory. Jukka and I have discussed some better 
support for this including temporary file support in Tika and we're working on 
improving it, but not there yet.

bq. don't really see the use-case for support in Solr

It's up to you guys. If you want to tell users of Solr, "hey you can drop a 
scientific data file format onto Solr and magically its metadata will be 
indexed", then it might be important. We do this in OODT quite often, and it's 
one of the core use cases (and we even use Lucene and Solr for the metadata 
catalogs :) ).

bq. Loading a 500 Megabyte file into memory just to get the header

A lot of times that header contains the key parameters (spatial and temporal 
bounds) that are required to make a decision as to what to do with the file, as 
well as other met fields including the remote sensing variables, or climate 
variables being measured, valid units, links to publications, etc. So it's more 
than useless information.

bq. Right, but how many people have these gigabyte climate data files

Depends on who is using it. Like I said, this is pretty much all of the files 
that I deal with :), but to each their own. Disabling it in Solr isn't really 
going to affect me (or others much) since OODT pretty much does this anyways, 
but meh.




                
> Binaries contain 1.6 classes
> ----------------------------
>
>                 Key: SOLR-3295
>                 URL: https://issues.apache.org/jira/browse/SOLR-3295
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.6
>
>         Attachments: output.log
>
>
> I've ran this tool (does the job): http://code.google.com/p/versioncheck/ on 
> the checkout of branch_3x. To my surprise there is a JAR which contains Java 
> 1.6 code:
> {noformat}
> Major.Minor Version : 50.0             JAVA compatibility : Java 1.6 
> platform: 45.3-50.0
> Number of classes : 60
> Classes are : 
> c:\Work\lucene-solr\.\solr\contrib\extraction\lib\netcdf-4.2-min.jar [:] 
> ucar/unidata/geoloc/Bearing.class
> ...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to