[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837378#comment-17837378 ] ASF GitHub Bot commented on TIKA-4181: -- nddipiazza commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1566194105 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,92 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; +package tika; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} + rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {} + rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {} + rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {} + rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) +returns (stream FetchAndParseReply) {} + rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) +returns (stream FetchAndParseReply) {} +} + +message CreateFetcherRequest { + string name = 1; + string fetcher_class = 2; Review Comment: string needed so people can dynamically add them. validation will make sure class exists and will return nice error message > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837377#comment-17837377 ] ASF GitHub Bot commented on TIKA-4181: -- nddipiazza commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1566193576 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,92 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; +package tika; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} Review Comment: added not-found detection > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832819#comment-17832819 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1546235205 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,92 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; +package tika; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} + rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {} + rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {} + rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {} + rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) +returns (stream FetchAndParseReply) {} + rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) +returns (stream FetchAndParseReply) {} +} + +message CreateFetcherRequest { + string name = 1; + string fetcher_class = 2; Review Comment: Should this be a protobuf enum containing the constrained set of classes? Or does Tika need to support arbitrary strings here in case of custom fetchers not included in the Tika project? > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832817#comment-17832817 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1546232766 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,92 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; +package tika; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} Review Comment: Could we document these RPCs to understand high level behaviour? For example, if I try to create a fetcher which already exists, what is the expected reply? Is it an error response on the RPC, will CreateFetchReply have error identifying information? If CreateFetcher was made idempotent, could we collapse these into a single RPC (UpdateFetcher), which either creates, or updates, or noops (no changes despite call) to the Fetcher? Don't want to over complicate the Tika side of course, but curious if we can improve the client interface. ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,92 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; +package tika; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} Review Comment: Similar to above regarding documentation, it would be great to understand what happens if I try to get a fetcher which does not exist. Is there a distinct error, or do I simply get an empty GetFetcherReply? > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832676#comment-17832676 ] ASF GitHub Bot commented on TIKA-4181: -- nddipiazza commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1545858447 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,90 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + +package tika; + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} + rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {} + rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {} + rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {} + rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream FetchAndParseReply) {} + rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns (stream FetchAndParseReply) {} +} + +message CreateFetcherRequest { + string name = 1; Review Comment: yes we are using "name" as the ID. @tballison any thoughts here? maybe we should rename that for 3.x > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832675#comment-17832675 ] ASF GitHub Bot commented on TIKA-4181: -- nddipiazza commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1545858324 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: i did the normal proto linter. i'm going to leave the other stuff there that buf extension stuff didn't see to add much value for my context and added hours > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832568#comment-17832568 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1545596130 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## @@ -0,0 +1,90 @@ +// Copyright 2015 The gRPC Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +syntax = "proto3"; + +option java_multiple_files = true; +option java_package = "org.apache.tika"; +option java_outer_classname = "TikaProto"; +option objc_class_prefix = "HLW"; + +package tika; + +service Tika { + rpc CreateFetcher(CreateFetcherRequest) returns (CreateFetcherReply) {} + rpc UpdateFetcher(UpdateFetcherRequest) returns (UpdateFetcherReply) {} + rpc GetFetcher(GetFetcherRequest) returns (GetFetcherReply) {} + rpc ListFetchers(ListFetchersRequest) returns (ListFetchersReply) {} + rpc DeleteFetcher(DeleteFetcherRequest) returns (DeleteFetcherReply) {} + rpc FetchAndParse(FetchAndParseRequest) returns (FetchAndParseReply) {} + rpc FetchAndParseServerSideStreaming(FetchAndParseRequest) returns (stream FetchAndParseReply) {} + rpc FetchAndParseBiDirectionalStreaming(stream FetchAndParseRequest) returns (stream FetchAndParseReply) {} +} + +message CreateFetcherRequest { + string name = 1; Review Comment: Must `name` be unique across all initialized fetchers? `name` to me implies it's a descriptive label, is this more of an ID? Use case I am thinking of if I create multiple fetchers with the same class. Right now I would create a unique name for each one. Is that the correct expectation? > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a local repository for development purposes) and here's the report: ``` services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name "pageNumber" should be lower_snake_case, such as "page_number". services/tika/pbtika/tika.proto:91:9:Field name "numFetchersPerPage" should be lower_snake_case, such as "num_fetchers_per_page". services/tika/pbtika/tika.proto:95:28:Field name "getFetcherReply" should be lower_snake_case, such as "get_fetcher_reply". Generating protobufs for ./proto/pbingest services/tika/pbtika/tika.proto:29:9:Service name "Tika" should be suffixed with "Service". services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:35:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:36:40:RPC request type "FetchAndParseRequest" should be named "FetchAndParseServerSideStreamingRequest" or "TikaFetchAndParseServerSideStreamingRequest". services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseReply" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:3:"tika.FetchAndParseRequest" is used as the request or response type for multiple RPCs. services/tika/pbtika/tika.proto:37:50:RPC request type "FetchAndParseRequest" should be named "FetchAndParseBiDirectionalStreamingRequest" or "TikaFetchAndParseBiDirectionalStreamingRequest". services/tika/pbtika/tika.proto:42:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:52:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:61:10:Field name "fetcherName" should be lower_snake_case, such as "fetcher_name". services/tika/pbtika/tika.proto:62:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:67:10:Field name "fetchKey" should be lower_snake_case, such as "fetch_key". services/tika/pbtika/tika.proto:85:10:Field name "fetcherClass" should be lower_snake_case, such as "fetcher_class". services/tika/pbtika/tika.proto:90:9:Field name
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814829#comment-17814829 ] Tim Allison commented on TIKA-4181: --- We recently had a request for something like what's in the diagram for tika-server: submit a fetch request, run the parse in a forked process (tika-pipes), but then instead of the emitter shipping off the results, the results are returned to the caller. I _think_ this is what you describe in the diagram. There is room in returned PipesResult for the full emitData. We need to modify the pipesClient to skip the usual emitting and return the full results of the parse for that request -- I think we can do that now by setting maxForEmitBatchBytes to a value < 0, but we should have a more elegant way of doing this. WDYT? > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814822#comment-17814822 ] Tim Allison commented on TIKA-4181: --- The response from the parse should be a PipesResult _and_ it should emit the contents of the parse to the specified emitter, no? > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > Attachments: image-2024-02-06-07-54-50-116.png > > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > !image-2024-02-06-07-54-50-116.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805762#comment-17805762 ] Nicholas DiPiazza commented on TIKA-4181: - Tika pipes could get a full fledged service that could be tika-server-http2 to accompany tika-server and maybe one day replace it? Not sure the best way to handle packaging the app, but we could create a secondary main method for running the tika-pipes as a grpc service. Then we would create a protobuf contract for each of the new services that we do: * pipe crud operations - create, update, delete, read, list, etc * run a pipe job - takes bidirectional streams of data - incoming=fetch metadata objects, outgoing=emitDocuments So you would then provide a Go example and Java example generated from our protobuf schema. that people could take and use > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805652#comment-17805652 ] Tim Allison commented on TIKA-4181: --- Sorry, I shouldn't have responded on the dev list. We should discuss this here. To confirm, this would be an alternative to tika-server? The response for /pipes would be pipes/parse status -- did the fetch->parse->emit work. The response for /async would be "got the request" or "too many" requests. I have a bunch of time on my hands :P and would be happy to chat via google meet or similar if that would be more efficient. > Grpc + Tika Pipes - pipe iterator and emitter > - > > Key: TIKA-4181 > URL: https://issues.apache.org/jira/browse/TIKA-4181 > Project: Tika > Issue Type: New Feature > Components: tika-pipes >Reporter: Nicholas DiPiazza >Priority: Major > > Add full tika-pipes support of grpc > * pipe iterator > * fetcher > * emitter > Requires we create a service contract that specifies the inputs we require > from each method. > Then we will need to implement the different components with a grpc client > generated using the contract. > This would enable developers to run tika-pipes as a persistently running > daemon instead of just a single batch app, because it can continue to stream > out more inputs. > -- This message was sent by Atlassian Jira (v8.20.10#820010)