php-general Digest 14 Dec 2012 05:38:55 -0000 Issue 8065

php-general-digest-help Thu, 13 Dec 2012 21:41:14 -0800

php-general Digest 14 Dec 2012 05:38:55 -0000 Issue 8065

Topics (messages 319868 through 319883):


Re: storing & searching docs
        319868 by: Jim Giner
        319871 by: Matijn Woudt
        319872 by: Jim Giner
        319873 by: Matijn Woudt
        319874 by: Bastien
        319875 by: Jim Giner
        319877 by: Matijn Woudt
        319878 by: Ashley Sheridan
        319879 by: Bastien
        319880 by: Jim Giner
        319881 by: Jim Giner
        319882 by: Jim Lucas

Re: Session ?
        319869 by: Marco Behnke
        319870 by: Jim Giner

Re: Lucene library
        319876 by: Larry Garfield

Weird MySQL+Gearman issue
        319883 by: FeIn

Administrivia:

To subscribe to the digest, e-mail:
        php-general-digest-subscr...@lists.php.net

To unsubscribe from the digest, e-mail:
        php-general-digest-unsubscr...@lists.php.net

To post to the list, e-mail:
        php-gene...@lists.php.net


----------------------------------------------------------------------

--- Begin Message ---
Thanks for the input gentlemen.  Two opposing viewpoints!
I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier!
More comments anyone?
--- End Message ---

--- Begin Message ---

On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:

> Thanks for the input gentlemen.  Two opposing viewpoints!
>
> I understand the concept of using files for the docs and a table to locate
> them and id them.  But I am of the opinion that modern dbs are capable of
> handling very large objects (of which these docs are NOT!) much easier than
> years ago, so I am leaning that way still.  It will certainly make my
> search process easier!
>
> More comments anyone?
>
>
I'm not sure if there's much difference between large text fields and
blobs, but I had a database (MySQL) with rows that had one blob each of
5-10 mb. At around 200-300 rows the database was pretty slow. After
reaching about 2000 rows, it was terrible. Opening the database with
phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
seconds. Doing a order by on one of the other rows, it took a few
minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
much of a difference.

So it depends on how large your docs are I guess..

- Matijn

--- End Message ---

--- Begin Message ---

On 12/13/2012 9:19 AM, Matijn Woudt wrote:

On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:

I'm not sure if there's much difference between large text fields and
blobs, but I had a database (MySQL) with rows that had one blob each of
5-10 mb. At around 200-300 rows the database was pretty slow. After
reaching about 2000 rows, it was terrible. Opening the database with
phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
seconds. Doing a order by on one of the other rows, it took a few
minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
much of a difference.

So it depends on how large your docs are I guess..

- Matijn

My docs are very small. Two hour meetings, 4 typed pages usually, so approx. 8K of real data each. I don't think storage is much of a concern here. The actual "doc" formats are around 28K and when converted to RTF they grow to 44K - still not very large.


Will this be a concern?

--- End Message ---

--- Begin Message ---

On Thu, Dec 13, 2012 at 3:32 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:

> On 12/13/2012 9:19 AM, Matijn Woudt wrote:
>
>> On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>
>> **wrote:
>>
>>
>>>
>>>  I'm not sure if there's much difference between large text fields and
>> blobs, but I had a database (MySQL) with rows that had one blob each of
>> 5-10 mb. At around 200-300 rows the database was pretty slow. After
>> reaching about 2000 rows, it was terrible. Opening the database with
>> phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
>> seconds. Doing a order by on one of the other rows, it took a few
>> minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
>> much of a difference.
>>
>> So it depends on how large your docs are I guess..
>>
>> - Matijn
>>
>>  My docs are very small.  Two hour meetings, 4 typed pages usually, so
> approx. 8K of real data each.  I don't think storage is much of a concern
> here.  The actual "doc" formats are around 28K and when converted to RTF
> they grow to 44K - still not very large.
>
> Will this be a concern?
>
>
That of course also depends on how many you are planning on storing. I
guess a few hundred will be ok, but after that I'm not so sure..

- Matijn

--- End Message ---

--- Begin Message ---

Bastien Koert

On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com> wrote:

> Thanks for the input gentlemen.  Two opposing viewpoints!
> 
> I understand the concept of using files for the docs and a table to locate 
> them and id them.  But I am of the opinion that modern dbs are capable of 
> handling very large objects (of which these docs are NOT!) much easier than 
> years ago, so I am leaning that way still.  It will certainly make my search 
> process easier!
> 
> More comments anyone?
> 
> -- 
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
> 

I got away from storing blobs in the db. I noticed significant slowness after 
the db grew to about 12gb in MySQL. Back ups also get affected as they take 
longer. This was older MySQL. But it also affected my mssql server the same 
way. 

Nowadays it's files into the file system and data into the db. One thing you 
could consider is reading the contents of the into a db field and just store 
the text to allow the full text search

Bastien

--- End Message ---

--- Begin Message ---

On 12/13/2012 10:56 AM, Bastien wrote:



Bastien Koert

On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com> wrote:

Thanks for the input gentlemen.  Two opposing viewpoints!

I understand the concept of using files for the docs and a table to locate them 
and id them.  But I am of the opinion that modern dbs are capable of handling 
very large objects (of which these docs are NOT!) much easier than years ago, 
so I am leaning that way still.  It will certainly make my search process 
easier!

More comments anyone?

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


I got away from storing blobs in the db. I noticed significant slowness after 
the db grew to about 12gb in MySQL. Back ups also get affected as they take 
longer. This was older MySQL. But it also affected my mssql server the same way.

Nowadays it's files into the file system and data into the db. One thing you 
could consider is reading the contents of the into a db field and just store 
the text to allow the full text search

Bastien

A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents?

Is there a single php function that would accomplish this?

--- End Message ---

--- Begin Message ---

On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:

> On 12/13/2012 10:56 AM, Bastien wrote:
>
>>
>>
>> Bastien Koert
>>
>> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
>> wrote:
>>
>>  Thanks for the input gentlemen.  Two opposing viewpoints!
>>>
>>> I understand the concept of using files for the docs and a table to
>>> locate them and id them.  But I am of the opinion that modern dbs are
>>> capable of handling very large objects (of which these docs are NOT!) much
>>> easier than years ago, so I am leaning that way still.  It will certainly
>>> make my search process easier!
>>>
>>> More comments anyone?
>>>
>>> --
>>> PHP General Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>> I got away from storing blobs in the db. I noticed significant slowness
>> after the db grew to about 12gb in MySQL. Back ups also get affected as
>> they take longer. This was older MySQL. But it also affected my mssql
>> server the same way.
>>
>> Nowadays it's files into the file system and data into the db. One thing
>> you could consider is reading the contents of the into a db field and just
>> store the text to allow the full text search
>>
>> Bastien
>>
>>  A very clever idea!  I like it - the best of both worlds.  Can you sum
> up a method for getting the text out of the .doc (or .rtf) files so that I
> can automate the process for my past and future documents?
> Is there a single php function that would accomplish this?


There's no builtin function for such stuff. doc files are quite tricky to
parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
[1], which provides you an API for doing this.

- Matijn

[1] http://sourceforge.net/projects/phprtf/

--- End Message ---

--- Begin Message ---

On Thu, 2012-12-13 at 18:41 +0100, Matijn Woudt wrote:

> On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner 
> <jim.gi...@albanyhandball.com>wrote:
> 
> > On 12/13/2012 10:56 AM, Bastien wrote:
> >
> >>
> >>
> >> Bastien Koert
> >>
> >> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
> >> wrote:
> >>
> >>  Thanks for the input gentlemen.  Two opposing viewpoints!
> >>>
> >>> I understand the concept of using files for the docs and a table to
> >>> locate them and id them.  But I am of the opinion that modern dbs are
> >>> capable of handling very large objects (of which these docs are NOT!) much
> >>> easier than years ago, so I am leaning that way still.  It will certainly
> >>> make my search process easier!
> >>>
> >>> More comments anyone?
> >>>
> >>> --
> >>> PHP General Mailing List (http://www.php.net/)
> >>> To unsubscribe, visit: http://www.php.net/unsub.php
> >>>
> >>>
> >> I got away from storing blobs in the db. I noticed significant slowness
> >> after the db grew to about 12gb in MySQL. Back ups also get affected as
> >> they take longer. This was older MySQL. But it also affected my mssql
> >> server the same way.
> >>
> >> Nowadays it's files into the file system and data into the db. One thing
> >> you could consider is reading the contents of the into a db field and just
> >> store the text to allow the full text search
> >>
> >> Bastien
> >>
> >>  A very clever idea!  I like it - the best of both worlds.  Can you sum
> > up a method for getting the text out of the .doc (or .rtf) files so that I
> > can automate the process for my past and future documents?
> > Is there a single php function that would accomplish this?
> 
> 
> There's no builtin function for such stuff. doc files are quite tricky to
> parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
> [1], which provides you an API for doing this.
> 
> - Matijn
> 
> [1] http://sourceforge.net/projects/phprtf/


As well as rtf, the OpenDoc format is easy to read from PHP. Essentially
it's just a bunch of XML files zipped up. Images are kept in the archive
too, which is a handy way to retrieve thumbnails of docs also!

Thanks,
Ash
http://www.ashleysheridan.co.uk

--- End Message ---

--- Begin Message ---

On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt <tijn...@gmail.com> wrote:
> On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner 
> <jim.gi...@albanyhandball.com>wrote:
>
>> On 12/13/2012 10:56 AM, Bastien wrote:
>>
>>>
>>>
>>> Bastien Koert
>>>
>>> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
>>> wrote:
>>>
>>>  Thanks for the input gentlemen.  Two opposing viewpoints!
>>>>
>>>> I understand the concept of using files for the docs and a table to
>>>> locate them and id them.  But I am of the opinion that modern dbs are
>>>> capable of handling very large objects (of which these docs are NOT!) much
>>>> easier than years ago, so I am leaning that way still.  It will certainly
>>>> make my search process easier!
>>>>
>>>> More comments anyone?
>>>>
>>>> --
>>>> PHP General Mailing List (http://www.php.net/)
>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>
>>>>
>>> I got away from storing blobs in the db. I noticed significant slowness
>>> after the db grew to about 12gb in MySQL. Back ups also get affected as
>>> they take longer. This was older MySQL. But it also affected my mssql
>>> server the same way.
>>>
>>> Nowadays it's files into the file system and data into the db. One thing
>>> you could consider is reading the contents of the into a db field and just
>>> store the text to allow the full text search
>>>
>>> Bastien
>>>
>>>  A very clever idea!  I like it - the best of both worlds.  Can you sum
>> up a method for getting the text out of the .doc (or .rtf) files so that I
>> can automate the process for my past and future documents?
>> Is there a single php function that would accomplish this?
>
>
> There's no builtin function for such stuff. doc files are quite tricky to
> parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
> [1], which provides you an API for doing this.
>
> - Matijn
>
> [1] http://sourceforge.net/projects/phprtf/


There is 
http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php
which has some discussion on reading those files with Antiword
(http://www.winfield.demon.nl/)

-- 

Bastien

Cat, the other other white meat

--- End Message ---

--- Begin Message ---

On 12/13/2012 2:40 PM, Bastien Koert wrote:

On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt <tijn...@gmail.com> wrote:

On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:

On 12/13/2012 10:56 AM, Bastien wrote:



Bastien Koert

On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
wrote:

  Thanks for the input gentlemen.  Two opposing viewpoints!


I understand the concept of using files for the docs and a table to
locate them and id them.  But I am of the opinion that modern dbs are
capable of handling very large objects (of which these docs are NOT!) much
easier than years ago, so I am leaning that way still.  It will certainly
make my search process easier!

More comments anyone?

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

I got away from storing blobs in the db. I noticed significant slowness
after the db grew to about 12gb in MySQL. Back ups also get affected as
they take longer. This was older MySQL. But it also affected my mssql
server the same way.

Nowadays it's files into the file system and data into the db. One thing
you could consider is reading the contents of the into a db field and just
store the text to allow the full text search

Bastien

  A very clever idea!  I like it - the best of both worlds.  Can you sum

up a method for getting the text out of the .doc (or .rtf) files so that I
can automate the process for my past and future documents?
Is there a single php function that would accomplish this?



There's no builtin function for such stuff. doc files are quite tricky to
parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
[1], which provides you an API for doing this.

- Matijn

[1] http://sourceforge.net/projects/phprtf/



There is 
http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php
which has some discussion on reading those files with Antiword
(http://www.winfield.demon.nl/)

But I can't get antiword. I'm running windows while my host is running linux. And there aren't any linux binaries available for download to put onto my host (assuming that I could do that!). Or am I missing something.
--- End Message ---

--- Begin Message --- Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is:
Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats.
Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file.
Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data.
Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match.
Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's "File,Save as", the resulting pdf is only 70kb. Might need a new macro!)
Thanks again!
--- End Message ---

--- Begin Message ---

On 12/13/2012 02:49 PM, Jim Giner wrote:

Thanks for all the posts. After reading and googling all afternoon, I
think the best approach for me is:

Create two macros in Word (done!) to export each of my .doc files to
.txt and .pdf formats.

Create a sql table to hold the .txt contents of my .doc files, along
with a reference to the meeting date and the name of the corresponding
.pdf file.

Upload my two sets of files with an ftp client and then use a script to
load the table with my .txt file data.

Now I just need a couple of scripts to allow a user to locate a file and
bring up the pdf for when he wants to read about a meeting. And a second
script to accept user input (search words) and perform a query against
the textual data and present some kind of results - probably a listing
containing a reference to the meeting date and a tbd-length string
showing the matching result for each occurrence, ie, something like n
chars in front of and after the match so the user can see the context of
the match.

Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in
.txt format. (actually, if I 'print' the .doc as a pdf instead of using
the Word's "File,Save as", the resulting pdf is only 70kb. Might need a
new macro!)

Thanks again!

I wrote this script a few years ago that extracted the plain text out of the .doc file.


http://www.cmsws.com/examples/applications/word2_/convert.php

if you look in the directory you will see a few example files.

You can view them like this.

.../convert.php?filename=test_building.doc

replace test_building.doc with any of the other .doc files from the dir listing to see its contents.

I currently have it set to 64bit width rows. Show you some nice pattern stuff with the MS Word format.


I have the source file viewable for the convert.php script as well.

http://www.cmsws.com/examples/applications/word2_/convert.phps

I have thought about extending this even further to figure out the layout and test formatting. But it hasn't gotten much attention for quite some time now.


Hope it helps.

--
Jim Lucas

http://www.cmsws.com/
http://www.cmsws.com/examples/

--- End Message ---

--- Begin Message ---

Am 13.12.12 14:49, schrieb Jim Giner:
>
>> Ok, that is a different answer from the previous one where you said "it
>> points to a folder within my main domain's structure"
>>
>> Are you running on error_reporting(E_ALL) and ini_set('display_errors',
>> 'On')?
>> Just to be sure that there are no hidden notices or warnings.
>>
>>
> my sub points to a folder within my domain's structure.  My session's
> store point (?) is \tmp.  You asked two different questions.
>
point taken ;)

I will try to do a setup like yours and check which code works for me.

-- 
Marco Behnke
Dipl. Informatiker (FH), SAE Audio Engineer Diploma
Zend Certified Engineer PHP 5.3

Tel.: 0174 / 9722336
e-Mail: ma...@behnke.biz

Softwaretechnik Behnke
Heinrich-Heine-Str. 7D
21218 Seevetal

http://www.behnke.biz

signature.asc
Description: OpenPGP digital signature

--- End Message ---

--- Begin Message ---

On 12/13/2012 9:16 AM, Marco Behnke wrote:

Am 13.12.12 14:49, schrieb Jim Giner:

Ok, that is a different answer from the previous one where you said "it
points to a folder within my main domain's structure"

Are you running on error_reporting(E_ALL) and ini_set('display_errors',
'On')?
Just to be sure that there are no hidden notices or warnings.

my sub points to a folder within my domain's structure.  My session's
store point (?) is \tmp.  You asked two different questions.

point taken ;)

I will try to do a setup like yours and check which code works for me.

Thanks for the interest.  Hope you have better luck than I.

--- End Message ---

--- Begin Message ---

Ah ha.  Did that ever get ported to Zend 2?

--Larry Garfield

On 12/12/12 12:07 AM, Louis Huppenbauer wrote:

There's Zend_Search_Lucene, part of the Zend framework. I think it should
be possible to use it without the whole framework though.

http://framework.zend.com/manual/1.12/de/zend.search.lucene.html


2012/12/12 Larry Garfield <la...@garfieldtech.com>

Yes, I've worked with Apache Solr quite a bit.  It's a separate server,
however, and I'm looking for something with smaller requirements for a
concept I want to try. I'd consider SQLite, but I really need something
schema-free and PHP-native/easily-installable.

--Larry Garfield


On 12/11/2012 07:20 PM, israele...@gmail.com wrote:

Check out apache solr.

The php implementation of Lucene was very slow and had a lot of
perfomance issues the last time I tried it
------Original Message------
From: Larry Garfield
To: php-gene...@lists.php.net
Subject: [PHP] Lucene library
Sent: Dec 11, 2012 5:41 PM

Hi all.

I recall hearing about there being a PHP port of the Lucene library some
years ago, but I don't recall whence it came.  It was a stand-alone PHP
lib, which needed some integration to be viable as an actual search
engine but worked up to a point by storing data straight on disk as
files.  That meant it didn't scale beyond a few tens of thousands of
records, but that's still a decent number.

Does that ring a bell for anyone?  Anyone know if it still exists, and
if so where?  I didn't find it in https://packagist.org/ , which is
where I figured it would be if it were still maintained.

I may have a use for it if it still exists.

--Larry Garfield


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

--- End Message ---

--- Begin Message ---

Hi all,

I have a typical web application that does some basic CRUD operations.
Operations that modify the database (inserts, updates, deletes)
trigger a background gearman job to refresh the cache that is used for
another application. The problem is that when I do an update the
gearman worker script (which has its own database connection) does not
pick up the updates. For example if I were to update a row: UPDATE
table SET rowname = 'updated value' WHERE rowid = 1, the respective
row is updated correctly, but when the worker gets around to handle
the job of refreshing the cache for that row the SELECT FROM table
WHERE rowid = 1 picks up the older value (the one before the update).
I activated the mysql log and the queries are definitely run in the
correct order (first the update and the select on a another
connection). Has anybody encountered this issue before? I should also
mention that the worker is setup to run a certain number of cache
refreshing jobs after which it will die, but it will use the same
connection to do those jobs. If I force the worker to use a new
connection for each job everything works fine, otherwise the update is
picked up only for the first job but not for the subsequent jobs. Any
ideas?

Thanks in advance.

--- End Message ---

php-general Digest 14 Dec 2012 05:38:55 -0000 Issue 8065

Reply via email to