I have a Pig script--currently running in local mode--that processes a
huge file containing a list of categories:

    /root/level1/level2/level3
    /root/level1/level2/level3/level4
    ...

I need to insert each of these into an existing database by calling a
stored procedure. Because I'm new to Pig and the UDF interface is a
little daunting, I'm trying to get something done by streaming the
file's content through a PHP script.

I'm finding that the PHP script only processes half of the category
lines I'm passing through it, though. More precisely, I see a record
returned for ceil( pig_categories/2 ). A limit of 15 will produce 8
entries after streaming through the PHP script--the last one will be
empty. Example output is shown below indicating the only the even
records are getting processed.

Here's a relevant snippet from my Pig script:

    all_categories = LOAD 'categories.txt' USING PigStorage() AS
(category:chararray);
    ...Several layers of filtering...
    ordered  = ORDER mappable_categories BY category;
    limited  = LIMIT ordered 10;

    categories = FOREACH limited GENERATE category;
    DUMP categories; -- Displays all 20 categories

    streamed = STREAM limited THROUGH `php -nF categorize.php`;
    DUMP streamed; -- Displays 10 categories

And the PHP script receiving the stream:

    $category = fgets( STDIN );
    echo $category;
    # Yep, that's all there is right now

Output:

    -- From the `DUMP categories` line
    (Arts)
    (Arts/Animation)
    (Arts/Animation/Anime)(Art s/Animation/Anime/Characters)
    (Arts/Animation/Anime/Clubs_and_Organizations)
    (Arts/Animation/Anime/Collectibles)
    (Arts/Animation/Anime/Collectibles/Cels)
    (Arts/Animation/Anime/Collectibles/Models_and_Figures)
    (Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
    (Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)

    -- From the `DUMP streamed` line
    (Arts/Animation)
    (Arts/Animation/Anime/Characters)
    (Arts/Animation/Anime/Collectibles)
    (Arts/Animation/Anime/Collectibles/Models_and_Figures)
    (Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)

As you can see, it looks like only the even lines are being handled by
the PHP script.

I haven't found any information about streaming through a PHP file, in
fact, very little info about streaming through any file. This is
particularly true for information about the content of the stream
receiver file. I'm really hoping someone here can help me out because
I'm kind of out of places to ask this question. Any guidance or
insight would be much appreciated. It's kind of important that I
process 100% of the records rather than half of them. :-)

I posted a question about this on StackOverflow yesterday
(http://stackoverflow.com/questions/3815673/pigs-stream-through-php),
but it doesn't look like there's much Pig visibility on SO at this
point. I'll update that question with any answer I get from this list.

Thanks for your help.

-- 
+rw
 
The information transmitted in this  
email is intended only for the  
person(s) or entity to which it is  
addressed and may contain  
confidential and/or privileged  
material. Any review,  
retransmission, dissemination  
or other use of, or taking of any  
action in reliance upon, this  
information by persons or entities  
other than the intended recipient  
is prohibited. If you received this  
email in error, please contact the  
sender and permanently delete the  
email from any computer.  

Reply via email to