php-general Digest 14 Dec 2012 05:38:55 -0000 Issue 8065
Topics (messages 319868 through 319883):
Re: storing & searching docs
319868 by: Jim Giner
319871 by: Matijn Woudt
319872 by: Jim Giner
319873 by: Matijn Woudt
319874 by: Bastien
319875 by: Jim Giner
319877 by: Matijn Woudt
319878 by: Ashley Sheridan
319879 by: Bastien
319880 by: Jim Giner
319881 by: Jim Giner
319882 by: Jim Lucas
Re: Session ?
319869 by: Marco Behnke
319870 by: Jim Giner
Re: Lucene library
319876 by: Larry Garfield
Weird MySQL+Gearman issue
319883 by: FeIn
Administrivia:
To subscribe to the digest, e-mail:
php-general-digest-subscr...@lists.php.net
To unsubscribe from the digest, e-mail:
php-general-digest-unsubscr...@lists.php.net
To post to the list, e-mail:
php-gene...@lists.php.net
----------------------------------------------------------------------
--- Begin Message ---
Thanks for the input gentlemen. Two opposing viewpoints!
I understand the concept of using files for the docs and a table to
locate them and id them. But I am of the opinion that modern dbs are
capable of handling very large objects (of which these docs are NOT!)
much easier than years ago, so I am leaning that way still. It will
certainly make my search process easier!
More comments anyone?
--- End Message ---
--- Begin Message ---
On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:
> Thanks for the input gentlemen. Two opposing viewpoints!
>
> I understand the concept of using files for the docs and a table to locate
> them and id them. But I am of the opinion that modern dbs are capable of
> handling very large objects (of which these docs are NOT!) much easier than
> years ago, so I am leaning that way still. It will certainly make my
> search process easier!
>
> More comments anyone?
>
>
I'm not sure if there's much difference between large text fields and
blobs, but I had a database (MySQL) with rows that had one blob each of
5-10 mb. At around 200-300 rows the database was pretty slow. After
reaching about 2000 rows, it was terrible. Opening the database with
phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
seconds. Doing a order by on one of the other rows, it took a few
minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
much of a difference.
So it depends on how large your docs are I guess..
- Matijn
--- End Message ---
--- Begin Message ---
On 12/13/2012 9:19 AM, Matijn Woudt wrote:
On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:
I'm not sure if there's much difference between large text fields and
blobs, but I had a database (MySQL) with rows that had one blob each of
5-10 mb. At around 200-300 rows the database was pretty slow. After
reaching about 2000 rows, it was terrible. Opening the database with
phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
seconds. Doing a order by on one of the other rows, it took a few
minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
much of a difference.
So it depends on how large your docs are I guess..
- Matijn
My docs are very small. Two hour meetings, 4 typed pages usually, so
approx. 8K of real data each. I don't think storage is much of a
concern here. The actual "doc" formats are around 28K and when
converted to RTF they grow to 44K - still not very large.
Will this be a concern?
--- End Message ---
--- Begin Message ---
On Thu, Dec 13, 2012 at 3:32 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:
> On 12/13/2012 9:19 AM, Matijn Woudt wrote:
>
>> On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner <jim.gi...@albanyhandball.com>
>> **wrote:
>>
>>
>>>
>>> I'm not sure if there's much difference between large text fields and
>> blobs, but I had a database (MySQL) with rows that had one blob each of
>> 5-10 mb. At around 200-300 rows the database was pretty slow. After
>> reaching about 2000 rows, it was terrible. Opening the database with
>> phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6
>> seconds. Doing a order by on one of the other rows, it took a few
>> minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make
>> much of a difference.
>>
>> So it depends on how large your docs are I guess..
>>
>> - Matijn
>>
>> My docs are very small. Two hour meetings, 4 typed pages usually, so
> approx. 8K of real data each. I don't think storage is much of a concern
> here. The actual "doc" formats are around 28K and when converted to RTF
> they grow to 44K - still not very large.
>
> Will this be a concern?
>
>
That of course also depends on how many you are planning on storing. I
guess a few hundred will be ok, but after that I'm not so sure..
- Matijn
--- End Message ---
--- Begin Message ---
Bastien Koert
On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com> wrote:
> Thanks for the input gentlemen. Two opposing viewpoints!
>
> I understand the concept of using files for the docs and a table to locate
> them and id them. But I am of the opinion that modern dbs are capable of
> handling very large objects (of which these docs are NOT!) much easier than
> years ago, so I am leaning that way still. It will certainly make my search
> process easier!
>
> More comments anyone?
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
I got away from storing blobs in the db. I noticed significant slowness after
the db grew to about 12gb in MySQL. Back ups also get affected as they take
longer. This was older MySQL. But it also affected my mssql server the same
way.
Nowadays it's files into the file system and data into the db. One thing you
could consider is reading the contents of the into a db field and just store
the text to allow the full text search
Bastien
--- End Message ---
--- Begin Message ---
On 12/13/2012 10:56 AM, Bastien wrote:
Bastien Koert
On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com> wrote:
Thanks for the input gentlemen. Two opposing viewpoints!
I understand the concept of using files for the docs and a table to locate them
and id them. But I am of the opinion that modern dbs are capable of handling
very large objects (of which these docs are NOT!) much easier than years ago,
so I am leaning that way still. It will certainly make my search process
easier!
More comments anyone?
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
I got away from storing blobs in the db. I noticed significant slowness after
the db grew to about 12gb in MySQL. Back ups also get affected as they take
longer. This was older MySQL. But it also affected my mssql server the same way.
Nowadays it's files into the file system and data into the db. One thing you
could consider is reading the contents of the into a db field and just store
the text to allow the full text search
Bastien
A very clever idea! I like it - the best of both worlds. Can you sum
up a method for getting the text out of the .doc (or .rtf) files so that
I can automate the process for my past and future documents?
Is there a single php function that would accomplish this?
--- End Message ---
--- Begin Message ---
On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:
> On 12/13/2012 10:56 AM, Bastien wrote:
>
>>
>>
>> Bastien Koert
>>
>> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
>> wrote:
>>
>> Thanks for the input gentlemen. Two opposing viewpoints!
>>>
>>> I understand the concept of using files for the docs and a table to
>>> locate them and id them. But I am of the opinion that modern dbs are
>>> capable of handling very large objects (of which these docs are NOT!) much
>>> easier than years ago, so I am leaning that way still. It will certainly
>>> make my search process easier!
>>>
>>> More comments anyone?
>>>
>>> --
>>> PHP General Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>> I got away from storing blobs in the db. I noticed significant slowness
>> after the db grew to about 12gb in MySQL. Back ups also get affected as
>> they take longer. This was older MySQL. But it also affected my mssql
>> server the same way.
>>
>> Nowadays it's files into the file system and data into the db. One thing
>> you could consider is reading the contents of the into a db field and just
>> store the text to allow the full text search
>>
>> Bastien
>>
>> A very clever idea! I like it - the best of both worlds. Can you sum
> up a method for getting the text out of the .doc (or .rtf) files so that I
> can automate the process for my past and future documents?
> Is there a single php function that would accomplish this?
There's no builtin function for such stuff. doc files are quite tricky to
parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
[1], which provides you an API for doing this.
- Matijn
[1] http://sourceforge.net/projects/phprtf/
--- End Message ---
--- Begin Message ---
On Thu, 2012-12-13 at 18:41 +0100, Matijn Woudt wrote:
> On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner
> <jim.gi...@albanyhandball.com>wrote:
>
> > On 12/13/2012 10:56 AM, Bastien wrote:
> >
> >>
> >>
> >> Bastien Koert
> >>
> >> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
> >> wrote:
> >>
> >> Thanks for the input gentlemen. Two opposing viewpoints!
> >>>
> >>> I understand the concept of using files for the docs and a table to
> >>> locate them and id them. But I am of the opinion that modern dbs are
> >>> capable of handling very large objects (of which these docs are NOT!) much
> >>> easier than years ago, so I am leaning that way still. It will certainly
> >>> make my search process easier!
> >>>
> >>> More comments anyone?
> >>>
> >>> --
> >>> PHP General Mailing List (http://www.php.net/)
> >>> To unsubscribe, visit: http://www.php.net/unsub.php
> >>>
> >>>
> >> I got away from storing blobs in the db. I noticed significant slowness
> >> after the db grew to about 12gb in MySQL. Back ups also get affected as
> >> they take longer. This was older MySQL. But it also affected my mssql
> >> server the same way.
> >>
> >> Nowadays it's files into the file system and data into the db. One thing
> >> you could consider is reading the contents of the into a db field and just
> >> store the text to allow the full text search
> >>
> >> Bastien
> >>
> >> A very clever idea! I like it - the best of both worlds. Can you sum
> > up a method for getting the text out of the .doc (or .rtf) files so that I
> > can automate the process for my past and future documents?
> > Is there a single php function that would accomplish this?
>
>
> There's no builtin function for such stuff. doc files are quite tricky to
> parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
> [1], which provides you an API for doing this.
>
> - Matijn
>
> [1] http://sourceforge.net/projects/phprtf/
As well as rtf, the OpenDoc format is easy to read from PHP. Essentially
it's just a bunch of XML files zipped up. Images are kept in the archive
too, which is a handy way to retrieve thumbnails of docs also!
Thanks,
Ash
http://www.ashleysheridan.co.uk
--- End Message ---
--- Begin Message ---
On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt <tijn...@gmail.com> wrote:
> On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner
> <jim.gi...@albanyhandball.com>wrote:
>
>> On 12/13/2012 10:56 AM, Bastien wrote:
>>
>>>
>>>
>>> Bastien Koert
>>>
>>> On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
>>> wrote:
>>>
>>> Thanks for the input gentlemen. Two opposing viewpoints!
>>>>
>>>> I understand the concept of using files for the docs and a table to
>>>> locate them and id them. But I am of the opinion that modern dbs are
>>>> capable of handling very large objects (of which these docs are NOT!) much
>>>> easier than years ago, so I am leaning that way still. It will certainly
>>>> make my search process easier!
>>>>
>>>> More comments anyone?
>>>>
>>>> --
>>>> PHP General Mailing List (http://www.php.net/)
>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>
>>>>
>>> I got away from storing blobs in the db. I noticed significant slowness
>>> after the db grew to about 12gb in MySQL. Back ups also get affected as
>>> they take longer. This was older MySQL. But it also affected my mssql
>>> server the same way.
>>>
>>> Nowadays it's files into the file system and data into the db. One thing
>>> you could consider is reading the contents of the into a db field and just
>>> store the text to allow the full text search
>>>
>>> Bastien
>>>
>>> A very clever idea! I like it - the best of both worlds. Can you sum
>> up a method for getting the text out of the .doc (or .rtf) files so that I
>> can automate the process for my past and future documents?
>> Is there a single php function that would accomplish this?
>
>
> There's no builtin function for such stuff. doc files are quite tricky to
> parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
> [1], which provides you an API for doing this.
>
> - Matijn
>
> [1] http://sourceforge.net/projects/phprtf/
There is
http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php
which has some discussion on reading those files with Antiword
(http://www.winfield.demon.nl/)
--
Bastien
Cat, the other other white meat
--- End Message ---
--- Begin Message ---
On 12/13/2012 2:40 PM, Bastien Koert wrote:
On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt <tijn...@gmail.com> wrote:
On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner <jim.gi...@albanyhandball.com>wrote:
On 12/13/2012 10:56 AM, Bastien wrote:
Bastien Koert
On 2012-12-13, at 9:10 AM, Jim Giner <jim.gi...@albanyhandball.com>
wrote:
Thanks for the input gentlemen. Two opposing viewpoints!
I understand the concept of using files for the docs and a table to
locate them and id them. But I am of the opinion that modern dbs are
capable of handling very large objects (of which these docs are NOT!) much
easier than years ago, so I am leaning that way still. It will certainly
make my search process easier!
More comments anyone?
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
I got away from storing blobs in the db. I noticed significant slowness
after the db grew to about 12gb in MySQL. Back ups also get affected as
they take longer. This was older MySQL. But it also affected my mssql
server the same way.
Nowadays it's files into the file system and data into the db. One thing
you could consider is reading the contents of the into a db field and just
store the text to allow the full text search
Bastien
A very clever idea! I like it - the best of both worlds. Can you sum
up a method for getting the text out of the .doc (or .rtf) files so that I
can automate the process for my past and future documents?
Is there a single php function that would accomplish this?
There's no builtin function for such stuff. doc files are quite tricky to
parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite
[1], which provides you an API for doing this.
- Matijn
[1] http://sourceforge.net/projects/phprtf/
There is
http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php
which has some discussion on reading those files with Antiword
(http://www.winfield.demon.nl/)
But I can't get antiword. I'm running windows while my host is running
linux. And there aren't any linux binaries available for download to
put onto my host (assuming that I could do that!). Or am I missing
something.
--- End Message ---
--- Begin Message ---
Thanks for all the posts. After reading and googling all afternoon, I
think the best approach for me is:
Create two macros in Word (done!) to export each of my .doc files to
.txt and .pdf formats.
Create a sql table to hold the .txt contents of my .doc files, along
with a reference to the meeting date and the name of the corresponding
.pdf file.
Upload my two sets of files with an ftp client and then use a script to
load the table with my .txt file data.
Now I just need a couple of scripts to allow a user to locate a file and
bring up the pdf for when he wants to read about a meeting. And a
second script to accept user input (search words) and perform a query
against the textual data and present some kind of results - probably a
listing containing a reference to the meeting date and a tbd-length
string showing the matching result for each occurrence, ie, something
like n chars in front of and after the match so the user can see the
context of the match.
Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in
.txt format. (actually, if I 'print' the .doc as a pdf instead of using
the Word's "File,Save as", the resulting pdf is only 70kb. Might need a
new macro!)
Thanks again!
--- End Message ---
--- Begin Message ---
On 12/13/2012 02:49 PM, Jim Giner wrote:
Thanks for all the posts. After reading and googling all afternoon, I
think the best approach for me is:
Create two macros in Word (done!) to export each of my .doc files to
.txt and .pdf formats.
Create a sql table to hold the .txt contents of my .doc files, along
with a reference to the meeting date and the name of the corresponding
.pdf file.
Upload my two sets of files with an ftp client and then use a script to
load the table with my .txt file data.
Now I just need a couple of scripts to allow a user to locate a file and
bring up the pdf for when he wants to read about a meeting. And a second
script to accept user input (search words) and perform a query against
the textual data and present some kind of results - probably a listing
containing a reference to the meeting date and a tbd-length string
showing the matching result for each occurrence, ie, something like n
chars in front of and after the match so the user can see the context of
the match.
Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in
.txt format. (actually, if I 'print' the .doc as a pdf instead of using
the Word's "File,Save as", the resulting pdf is only 70kb. Might need a
new macro!)
Thanks again!
I wrote this script a few years ago that extracted the plain text out of
the .doc file.
http://www.cmsws.com/examples/applications/word2_/convert.php
if you look in the directory you will see a few example files.
You can view them like this.
.../convert.php?filename=test_building.doc
replace test_building.doc with any of the other .doc files from the dir
listing to see its contents.
I currently have it set to 64bit width rows. Show you some nice pattern
stuff with the MS Word format.
I have the source file viewable for the convert.php script as well.
http://www.cmsws.com/examples/applications/word2_/convert.phps
I have thought about extending this even further to figure out the
layout and test formatting. But it hasn't gotten much attention for
quite some time now.
Hope it helps.
--
Jim Lucas
http://www.cmsws.com/
http://www.cmsws.com/examples/
--- End Message ---
--- Begin Message ---
Am 13.12.12 14:49, schrieb Jim Giner:
>
>> Ok, that is a different answer from the previous one where you said "it
>> points to a folder within my main domain's structure"
>>
>> Are you running on error_reporting(E_ALL) and ini_set('display_errors',
>> 'On')?
>> Just to be sure that there are no hidden notices or warnings.
>>
>>
> my sub points to a folder within my domain's structure. My session's
> store point (?) is \tmp. You asked two different questions.
>
point taken ;)
I will try to do a setup like yours and check which code works for me.
--
Marco Behnke
Dipl. Informatiker (FH), SAE Audio Engineer Diploma
Zend Certified Engineer PHP 5.3
Tel.: 0174 / 9722336
e-Mail: ma...@behnke.biz
Softwaretechnik Behnke
Heinrich-Heine-Str. 7D
21218 Seevetal
http://www.behnke.biz
signature.asc
Description: OpenPGP digital signature
--- End Message ---
--- Begin Message ---
On 12/13/2012 9:16 AM, Marco Behnke wrote:
Am 13.12.12 14:49, schrieb Jim Giner:
Ok, that is a different answer from the previous one where you said "it
points to a folder within my main domain's structure"
Are you running on error_reporting(E_ALL) and ini_set('display_errors',
'On')?
Just to be sure that there are no hidden notices or warnings.
my sub points to a folder within my domain's structure. My session's
store point (?) is \tmp. You asked two different questions.
point taken ;)
I will try to do a setup like yours and check which code works for me.
Thanks for the interest. Hope you have better luck than I.
--- End Message ---
--- Begin Message ---
Ah ha. Did that ever get ported to Zend 2?
--Larry Garfield
On 12/12/12 12:07 AM, Louis Huppenbauer wrote:
There's Zend_Search_Lucene, part of the Zend framework. I think it should
be possible to use it without the whole framework though.
http://framework.zend.com/manual/1.12/de/zend.search.lucene.html
2012/12/12 Larry Garfield <la...@garfieldtech.com>
Yes, I've worked with Apache Solr quite a bit. It's a separate server,
however, and I'm looking for something with smaller requirements for a
concept I want to try. I'd consider SQLite, but I really need something
schema-free and PHP-native/easily-installable.
--Larry Garfield
On 12/11/2012 07:20 PM, israele...@gmail.com wrote:
Check out apache solr.
The php implementation of Lucene was very slow and had a lot of
perfomance issues the last time I tried it
------Original Message------
From: Larry Garfield
To: php-gene...@lists.php.net
Subject: [PHP] Lucene library
Sent: Dec 11, 2012 5:41 PM
Hi all.
I recall hearing about there being a PHP port of the Lucene library some
years ago, but I don't recall whence it came. It was a stand-alone PHP
lib, which needed some integration to be viable as an actual search
engine but worked up to a point by storing data straight on disk as
files. That meant it didn't scale beyond a few tens of thousands of
records, but that's still a decent number.
Does that ring a bell for anyone? Anyone know if it still exists, and
if so where? I didn't find it in https://packagist.org/ , which is
where I figured it would be if it were still maintained.
I may have a use for it if it still exists.
--Larry Garfield
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
Hi all,
I have a typical web application that does some basic CRUD operations.
Operations that modify the database (inserts, updates, deletes)
trigger a background gearman job to refresh the cache that is used for
another application. The problem is that when I do an update the
gearman worker script (which has its own database connection) does not
pick up the updates. For example if I were to update a row: UPDATE
table SET rowname = 'updated value' WHERE rowid = 1, the respective
row is updated correctly, but when the worker gets around to handle
the job of refreshing the cache for that row the SELECT FROM table
WHERE rowid = 1 picks up the older value (the one before the update).
I activated the mysql log and the queries are definitely run in the
correct order (first the update and the select on a another
connection). Has anybody encountered this issue before? I should also
mention that the worker is setup to run a certain number of cache
refreshing jobs after which it will die, but it will use the same
connection to do those jobs. If I force the worker to use a new
connection for each job everything works fine, otherwise the update is
picked up only for the first job but not for the subsequent jobs. Any
ideas?
Thanks in advance.
--- End Message ---