Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
I should clarify that I made the Profile db file searchable:
https://corpora.tika.apache.org/base/metadata/tika-eval/tika-eval-1.24.1.mv.db.gz.


I didn't load the mimes csvs, but I can certainly do that as well.

On Wed, Jun 17, 2020 at 11:33 AM Tim Allison  wrote:

> All,
>
>   I have Datasette working locally. I converted h2 to sqlite trivially.
>
>   Datasette is pretty slick, especially if we document example sql calls.
> It works quite easily from Docker and only allows "SELECT" calls...I tried
> to drop/insert/update/modify with (fortunately) no luck.
>
>   Are there any objections to opening a port and launching this on our
> server?  If no objections, any preference for port?
>
>  Cheers,
>
>Tim
>
>
>
> On Wed, Jun 17, 2020 at 9:04 AM Tim Allison  wrote:
>
>> Downloading the entire db and then running it locally with unfamiliar
>> code isn’t easy enough?!
>>
>> But seriously, will look into Datasette. Thank you!
>>
>> Happy to set up Postgres as well.
>>
>> On Wed, Jun 17, 2020 at 8:19 AM Nick Burch  wrote:
>>
>>> Hi All
>>>
>>> As I understand it (which might be wrong!), Tim is generating a bunch of
>>> reports on things in the corpa / how different tools analyse the corpa /
>>> how Tika works on the stuff there, mostly as SQL databases
>>>
>>> Those databases are then available to anyone who is interest to download
>>> and analyse locally from eg
>>> https://corpora.tika.apache.org/base/metadata/mimes/
>>> (though that URL isn't working right now, hopefully fixed soon)
>>>
>>> There's a fairly new project called Datasette, which is a really nice
>>> publishing and exploring interface on top of SQL databases, especially
>>> aimed at archivists, journalists etc -
>>> https://github.com/simonw/datasette
>>>
>>> I wonder (though I won't have time for a few weeks to try myself...) if
>>> it'd be worth stuffing one or two of the SQL reports into a copy of
>>> datasette hosted on the vm, to let people more easily explore the data?
>>>
>>> Cheers
>>> Nick
>>>
>>


Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
All,

  I have Datasette working locally. I converted h2 to sqlite trivially.

  Datasette is pretty slick, especially if we document example sql calls.
It works quite easily from Docker and only allows "SELECT" calls...I tried
to drop/insert/update/modify with (fortunately) no luck.

  Are there any objections to opening a port and launching this on our
server?  If no objections, any preference for port?

 Cheers,

   Tim



On Wed, Jun 17, 2020 at 9:04 AM Tim Allison  wrote:

> Downloading the entire db and then running it locally with unfamiliar code
> isn’t easy enough?!
>
> But seriously, will look into Datasette. Thank you!
>
> Happy to set up Postgres as well.
>
> On Wed, Jun 17, 2020 at 8:19 AM Nick Burch  wrote:
>
>> Hi All
>>
>> As I understand it (which might be wrong!), Tim is generating a bunch of
>> reports on things in the corpa / how different tools analyse the corpa /
>> how Tika works on the stuff there, mostly as SQL databases
>>
>> Those databases are then available to anyone who is interest to download
>> and analyse locally from eg
>> https://corpora.tika.apache.org/base/metadata/mimes/
>> (though that URL isn't working right now, hopefully fixed soon)
>>
>> There's a fairly new project called Datasette, which is a really nice
>> publishing and exploring interface on top of SQL databases, especially
>> aimed at archivists, journalists etc -
>> https://github.com/simonw/datasette
>>
>> I wonder (though I won't have time for a few weeks to try myself...) if
>> it'd be worth stuffing one or two of the SQL reports into a copy of
>> datasette hosted on the vm, to let people more easily explore the data?
>>
>> Cheers
>> Nick
>>
>


[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-06-17 Thread Parth (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138526#comment-17138526
 ] 

Parth commented on TIKA-3104:
-

Thanks! I will let you know once I try it out.

> Detection of memgraph files exported from Xcode
> ---
>
> Key: TIKA-3104
> URL: https://issues.apache.org/jira/browse/TIKA-3104
> Project: Tika
>  Issue Type: Wish
>  Components: core
>Affects Versions: 1.24
>Reporter: Parth
>Assignee: Tim Allison
>Priority: Major
>  Labels: detection, features, new-parser
> Fix For: 1.25
>
> Attachments: DeepScroll_Example[4988].memgraph, memgraph.xml, 
> out.memgraph.json, out.memgraph.xhtml
>
>
> I wanted to detect a memgraph file linked by a url. But currently detection 
> of memgraph file is not supported. I tried adding to custom-mimetypes but 
> that did not help.  
> 
>  
>  
>  
> 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
Downloading the entire db and then running it locally with unfamiliar code
isn’t easy enough?!

But seriously, will look into Datasette. Thank you!

Happy to set up Postgres as well.

On Wed, Jun 17, 2020 at 8:19 AM Nick Burch  wrote:

> Hi All
>
> As I understand it (which might be wrong!), Tim is generating a bunch of
> reports on things in the corpa / how different tools analyse the corpa /
> how Tika works on the stuff there, mostly as SQL databases
>
> Those databases are then available to anyone who is interest to download
> and analyse locally from eg
> https://corpora.tika.apache.org/base/metadata/mimes/
> (though that URL isn't working right now, hopefully fixed soon)
>
> There's a fairly new project called Datasette, which is a really nice
> publishing and exploring interface on top of SQL databases, especially
> aimed at archivists, journalists etc -
> https://github.com/simonw/datasette
>
> I wonder (though I won't have time for a few weeks to try myself...) if
> it'd be worth stuffing one or two of the SQL reports into a copy of
> datasette hosted on the vm, to let people more easily explore the data?
>
> Cheers
> Nick
>


Try datasette for browsing corpa sql reports?

2020-06-17 Thread Nick Burch

Hi All

As I understand it (which might be wrong!), Tim is generating a bunch of 
reports on things in the corpa / how different tools analyse the corpa / 
how Tika works on the stuff there, mostly as SQL databases


Those databases are then available to anyone who is interest to download 
and analyse locally from eg 
https://corpora.tika.apache.org/base/metadata/mimes/

(though that URL isn't working right now, hopefully fixed soon)

There's a fairly new project called Datasette, which is a really nice 
publishing and exploring interface on top of SQL databases, especially 
aimed at archivists, journalists etc - 
https://github.com/simonw/datasette


I wonder (though I won't have time for a few weeks to try myself...) if 
it'd be worth stuffing one or two of the SQL reports into a copy of 
datasette hosted on the vm, to let people more easily explore the data?


Cheers
Nick


[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

2020-06-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138315#comment-17138315
 ] 

Tim Allison commented on TIKA-3097:
---

Yes, even if the file is read as a stream. IIRC, some files only work with 
files because they need random access to the stream. For example, if the xlsx 
parser hits  sheet1.xml before hitting the sharedstrings.xml as it streams the 
zip entries, it’d be out of luck.

Even without needing random access, some parsers may choose to build the 
document components in memory for various reasons before we can extract text.

We try to stream as we can, but some file formats are less than helpful for 
streaming and some of the parsers in our dependencies are not optimized for 
text extraction.

If you find obvious areas for improvements, let us know.

> Out of memory while parsing docx
> 
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.24
>Reporter: suchendra
>Priority: Major
> Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, 
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3116) .docx can't extract text in nested text content-control

2020-06-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138306#comment-17138306
 ] 

Tim Allison commented on TIKA-3116:
---

Would you be willing to share your fix/patch on POI’s bugzilla or was this at 
the Tika level? Thank you, again!

> .docx can't extract text in nested text content-control
> ---
>
> Key: TIKA-3116
> URL: https://issues.apache.org/jira/browse/TIKA-3116
> Project: Tika
>  Issue Type: Bug
>Reporter: lee james
>Priority: Critical
> Attachments: test-document (1).docx
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: renaming master?

2020-06-17 Thread Ray Gauss II
Hi all,

Apologies for not being able to be very involved over the past few years, but 
still trying to follow along and hoping to get time to contribute in the future.

Another option might be ‘stable’?

- Ray

> On Jun 16, 2020, at 1:31 PM, Tim Allison  wrote:
> 
> All,
> 
>  As you may have seen, there's a movement to rename the "master" branch to
> "main" or "trunk" (at least in the U.S.)[1][2].  Github is doing this, and
> I personally think this makes sense.
> 
>  Are there any objections if we change "master"?  If we do change it, is
> there a preference for "main", "trunk" or something else?
> 
>  My personal preference would be for trunk, but I'm open.
> 
> Best,
> 
> Tim
> 
> [1]
> https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/
> [2] https://www.bbc.com/news/technology-53050955